Unicode® Standard Annex #31

Unicode Identifiers and Syntax

Version	Unicode 16.0.0
Editors	Mark Davis (mark@unicode.org)and Robin Leroy (eggrobin@unicode.org)
Date	2024-09-02
This Version	https://www.unicode.org/reports/tr31/tr31-41.html
Previous Version	https://www.unicode.org/reports/tr31/tr31-39.html
Latest Version	https://www.unicode.org/reports/tr31/
Latest Proposed Update	https://www.unicode.org/reports/tr31/proposed.html
Revision	41

Summary

This annex describes specifications for recommended defaultsfor the use of Unicode in the definitions of general-purpose identifiers, immutable identifiers, hashtag identifiers, and inpattern-based syntax. It also supplies guidelines for use ofnormalization with identifiers.

Status

This document has been reviewed by Unicode members and otherinterested parties, and has been approved for publication by theUnicode Consortium. This is a stable document and may be used asreference material or cited as a normative reference by otherspecifications.

A Unicode Standard Annex (UAX) forms an integral partof the Unicode Standard, but is published online as a separatedocument. The Unicode Standard may require conformance to normativecontent in a Unicode Standard Annex, if so specified in theConformance chapter of that version of the Unicode Standard. Theversion number of a UAX document corresponds to the version of theUnicode Standard of which it forms a part.

Please submit corrigenda and other comments with the onlinereporting form [Feedback].Related information that is useful in understanding this annex isfound in Unicode Standard Annex #41, “CommonReferences for Unicode Standard Annexes.” For the latest version ofthe Unicode Standard, see [Unicode]. For alist of current Unicode Technical Reports, see [Reports]. For moreinformation about versions of the Unicode Standard, see [Versions]. For anyerrata which may apply to this annex, see [Errata].

1Introduction
- Figure 1.Code Point Categories for Identifier Parsing
- 1.1Stability
  - Table 1.Permitted Changes in Future Versions
- 1.2Customization
- 1.3Display Format
- 1.4Conformance
- 1.5Notation
2Default Identifiers
- Table 2.Properties for Lexical Classes for Identifiers
- 2.1Combining Marks
- 2.2Modifier Letters
- 2.3Layoutand Format Control Characters
- 2.4SpecificCharacter Adjustments
  - Table 3.OptionalCharacters for Start
  - Table 3a.OptionalCharacters for Medial
  - Table 3b.Optional Characters for Continue
  - Table 4.Excluded Scripts
  - Table 5.Recommended Scripts
  - Table 6.Aspirational Use Scripts (Withdrawn)
  - Table 7.Limited Use Scripts
- 2.5BackwardCompatibility
3Immutable Identifiers
4Whitespace and Syntax
- 4.1Whitespace
  - 4.1.1Bidirectional Ordering
  - 4.1.2Required_Spaces
  - 4.1.3Contexts for Ignorable Format Controls
- 4.2Syntax
  - 4.2.1User-Defined Operators
- 4.3Pattern Syntax
5Normalization andCase
- 5.1NFKC Modifications
  - 5.1.1Modificationsfor Characters that Behave Like Combining Marks
  - 5.1.2Modifications forIrregularly Decomposing Characters
  - 5.1.3IdentifierClosure Under Normalization
    - Figure 5.Normalization Closure
    - Figure 6.CaseClosure
    - Figure 7.Reverse Normalization Closure
    - Table 8.Compatibility Equivalents to Letters or Decimal Numbers
    - Table 9.Canonical Equivalence Exceptions Prior to Unicode 5.1
- 5.2Case and Stability
  - 5.2.1Edge Casesfor Folding
6Hashtag Identifiers
7Standard Profiles
Acknowledgments
References
Migration
Modifications

1Introduction

A common task facing an implementer of the Unicode Standard is theprovision of a parsing and/or lexing engine for identifiers, such asprogramming language variables or domain names.There are also realms where identifiers need to be defined with an extended set ofcharacters to align better with what end users expect, such as inhashtags.

To assist in the standard treatment of identifiers in Unicodecharacter-based parsers and lexical analyzers, a set ofspecifications is provided here as abasis for parsing identifiers that contain Unicode characters. These specificationsinclude:

Default Identifiers: arecommended default for the definition of identifiers.
ImmutableIdentifiers: for environments that need a definition ofidentifiers that does not change across versions of Unicode.
HashtagIdentifiers: for identifiers that need a broader set ofcharacters, principally for hashtags.

These guidelines follow the typical pattern of identifiersyntax rules in common programming languages, by defining an ID_Startclass and an ID_Continue class and using a simple BNF rule foridentifiers based on those classes; however, the composition of thoseclasses is more complex and contains additional types of characters,due to the universal scope of the Unicode Standard.

This annex also provides guidelines for the use of normalization andcase insensitivity with identifiers, expanding on a section that wasoriginally in Unicode Standard Annex #15, “Unicode NormalizationForms” [UAX15].

Lexical analysis of computer languages is also concerned with lexicalelements other than identifiers, and with white space and line breaksthat separate them. This annex provides guidelines for the sets ofcharacters that have such lexical significance outside of identifiers.

The specification in this annex provides a definition of identifiersthat is guaranteed to be backward compatible with each successiverelease of Unicode, but also allows any appropriate new Unicodecharacters to become available in identifiers. In addition, Unicodecharacter properties for stable pattern syntax are provided. Theresulting pattern syntax is backward compatibleand forwardcompatible over future versions of the Unicode Standard. Theseproperties can either be used alone or in conjunction with theidentifier characters.

Figure 1 shows the disjoint categories of code points definedin this annex. (The sizes of the boxes are not to scale.)

Figure 1.CodePoint Categories for Identifier Parsing

Pattern_Syntax
CharactersUnassigned CodePointsPattern_White_Space
Characters

ID_Start Characters
ID_Nonstart Characters
Other Assigned CodePoints

The set consisting of the union ofID_Start andID_Nonstartcharacters is known asIdentifier Characters and has thepropertyID_Continue. TheID_Nonstart set is defined asthe set differenceID_Continue minusID_Start: it isnot a formal Unicode property. While lexical rules are traditionallyexpressed in terms of the latter, the discussion here is simplifiedby referring to disjoint categories.

1.1Stability

There are certain features that developers can depend on forstability:

Identifier characters, Pattern_Syntax characters, andPattern_White_Space are disjoint: they will never overlap.
By definition, the Identifier characters are always a superset of theID_Start characters.
The Pattern_Syntax characters and Pattern_White_Spacecharacters are immutable and will not change over successiveversions of Unicode.
The ID_Start and ID_Nonstart characters may grow over time,either by the addition of new characters provided in a futureversion of Unicode or (in rare cases) by the addition of charactersthat were in Other.

In successive versions of Unicode, the only allowed changes ofcharacters from one of the above classes to another are those listedwith a plus sign (+) inTable 1.

Table 1.PermittedChanges in Future Versions

	ID_Start	ID_Nonstart	Other Assigned
Unassigned
Other Assigned
ID_Nonstart

The Unicode Consortium has formally adopted a stability policy onidentifiers. For more information, see [Stability].

1.2Customization

Each programming language standard has its own identifiersyntax; different programming languages have different conventionsfor the use of certain characters such as $, @, #, and _ inidentifiers. To extend such a syntax to cover the full behavior of aUnicode implementation, implementers may combine those specific ruleswith the syntax and properties provided here.

Each programming language can define its identifier syntax asrelativeto the Unicode identifier syntax, such as saying that identifiers aredefined by the Unicode properties, with the addition of “$”. Byaddition or subtraction of a small set of language specificcharacters, a programming language standard can easily track agrowing repertoire of Unicode characters in a compatible way. SeealsoSection 2.5,BackwardCompatibility.

Similarly, each programming language can define its ownwhitespace characters or syntax characters relative to the UnicodePattern_White_Space or Pattern_Syntax characters, with some specifiedset of additions or subtractions.

Systems that want to extend identifiers to encompass words used innatural languages, or narrow identifiers for security may do so asdescribed inSection 2.3,Layout and FormatControl Characters,Section 2.4,Specific CharacterAdjustments, andSection 5,Normalization and Case.

To preserve the disjoint nature of the categories illustrated inFigure1, any characteradded to one of the categories must besubtractedfrom the others.

Note: In many cases there are importantsecurity implications that may require additional constraints onidentifiers. For more information, see [UTR36].

1.3DisplayFormat

Implementations may use a format fordisplaying identifiersthat differs from the internal form used tocompareidentifiers. For example, an implementation might display whatthe user has entered, but use a normalized format for comparison.Examples of this include:

Case.The display format retains case differences,but the comparison format erases them by using Case_Folding. Thus“A” and its lowercase variant “a” would be treated as the sameidentifier internally, even though they may have been inputdifferently and may display differently.
Variants.The display format retains variantdistinctions, such as halfwidth versus fullwidth forms, or betweenvariation sequences and their base characters, but the comparisonformat erases them by using NFKC_Case_Folding. Thus “A” and itsfull-width variant “Ａ” would be treated as the same identifierinternally, even though they may have been input differently and maydisplay differently.

For an example of the use of display versus comparison formats seeUTS#46: Unicode IDNA Compatibility Processing [UTS46]. For more informationabout normalization and case in identifiers seeSection 5,Normalization and Case.

1.4Conformance

The following describes the possible ways that animplementation can claim conformance to this specification.

UAX31-C1.An implementationclaiming conformance to this specification shall identify theversion of this specification.

Note: An implementation can make use of the property-based definitions from a specific version of thisspecification with property assignments from an unversioned reference to the Unicode Character Database.In this case, the implementation should specify a minimum version of Unicode for the properties.

UAX31-C2.An implementationclaiming conformance to this specification shall describe which ofthe following requirements it observes:

Note: RequirementR1a has been removed. The characters that were added when meetingthis requirement are now part of the default; the contextual checks required by thisrequirement remain as part of the General Security Profile in Unicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39].

Note: Meeting requirement R3 is equivalent to meeting requirements R3a and R3b.

1.5Notation

This annex usesUnicodeSet notation to illustrate the derivation ofsome properties or sets of characters.This notation is defined in the“Unicode Sets” section ofUTS #35, Unicode Locale Data Markup Language[UTS35].

2Default Identifiers

The formal syntax provided here captures the general intentthat an identifier consists of a string of characters beginning witha letter or an ideograph, and followed by any number of letters,ideographs, digits, or underscores. It provides a definition ofidentifiers that is guaranteed to be backward compatible with eachsuccessive release of Unicode, but also adds any appropriate newUnicode characters.

The formulations allow for extensions, alsoknown asprofiles. That is, the particular set of code points or sequences of code points foreach category used by the syntax can be customized according to therequirements of the environment. Profiles are describedas additions to or removals from the categories used by the syntax.They can thus be combined, provided that there are no conflicts (whereby one profile adds a characterand another removes it), or that the resolution of such conflicts is specified.

If such extensions include characters from Pattern_White_Space orPattern_Syntax, then such identifiers do not conform to an unmodifiedUAX31-R3 Pattern_White_Space and Pattern_SyntaxCharacters. However, such extensions may often be necessary. Forexample, Java and C++ identifiers include ‘$’, which is aPattern_Syntax character.

UAX31-D1.DefaultIdentifier Syntax:

<Identifier> := <Start> <Continue>*(<Medial> <Continue>+)*

Identifiers are defined by assigning thesets of lexical classes defined as properties in the UnicodeCharacter Database [UAX44].These properties are shown inTable 2. Thefirst column shows the property name, whose values are defined inthe UCD. The second column provides a general description of thecoverage for the associated class, the derivational relationshipbetween the ID properties and the XID properties, and an associatedUnicodeSet notation for the class.

Table 2.Properties for Lexical Classes forIdentifiers

Properties	General Description of Coverage
`ID_Start`	`ID_Start` charactersare derived from the UnicodeGeneral_Category of uppercase letters, lowercase letters,titlecase letters, modifier letters, other letters, letternumbers, plus Other_ID_Start, minus Pattern_Syntax andPattern_White_Space code points. In UnicodeSet notation: [\p{L}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
`XID_Start`	`XID_Start` characters arederived from`ID_Start` as perSection 5.1,NFKC Modifications.
`ID_Continue`	`ID_Continue`characters include ID_Start characters, plus characters having theUnicode General_Category of nonspacing marks, spacing combiningmarks, decimal number, connector punctuation, plusOther_ID_Continue, minus Pattern_Syntax and Pattern_White_Spacecode points. In UnicodeSet notation: [\p{ID_Start}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
`XID_Continue`	`XID_Continue`characters are derived from`ID_Continue` as perSection5.1,NFKC Modifications. `XID_Continue`characters are also known simply asIdentifier Characters,because they are a superset of the`XID_Start` characters.

Note that “other letters” includes ideographs. For more about thestability extensions, seeSection 2.5Backward Compatibility.

The innovations in the identifier syntax to cover the UnicodeStandard include the following:

Incorporation of proper handling of combining marks.
Allowance for layout and format control characters, whichshould be ignored when parsing identifiers.

The XID_Start and XID_Continue properties are improved lexicalclasses that incorporate the changes described inSection 5.1,NFKC Modifications.They are recommended for most purposes, especially for security,over the original ID_Start and ID_Continue properties.

UAX31-R1.DefaultIdentifiers:To meet this requirement, to determine whether a stringis an identifier an implementation shallchoose eitherUAX31-R1-1 orUAX31-R1-2.

UAX31-R1-1.Use definitionUAX31-D1, setting Start andContinue to the properties XID_Start and XID_Continue, respectively, and leaving Medial empty.

UAX31-R1-2.Declare that it uses aprofileofUAX31-R1-1and define that profile with a precise specification of thecharacters and character sequences that are added to or removed from Start,Continue, and Medial and/or provide a list of additionalconstraints on identifiers.

Note: Such a specification may incorporate a reference to one or more of thestandard profiles described inSection 7,StandardProfiles.

One such profile may be to use the contents of ID_Start and ID_Continue in place of XID_Start and XID_Continue, for backward compatibility.

Another such profile would be to include some set of the optional characters, for example:

Start := XID_Start, plus some characters fromTable 3
Continue := Start + XID_Continue, plus some characters fromTable 3b
Medial := some characters fromTable 3a

Note: Characters in the Medial class must not overlap with those in either the Start or Continue classes. Thus, any characters added to the Medial class fromTable 3amust be be checked to ensure they do not also occur in either the newly defined Start classor Continue class.

Beyond such minor modifications, profiles could also be used to significantly extend thecharacter set available in identifiers.In so doing, care must be taken not to unintentionally include undesired characters,or to violate important invariants.

An implementation should be careful when adding a property-based set to a profile.

For example, consider a profile that adds subscript and superscript digits andoperators in order to support technical notations, such as:

Context	Example Identifier
Assyriology	`dun₃⁺`
Chemistry	`Ca²⁺_concentration`
Mathematics	`xₖ₊₁`or`f⁽⁴⁾`
Phonetics	`daan⁶`

That profile may be described as adding the following set to XID_Continue:

[⁽₍⁾₎⁺₊⁼₌⁻₋⁰₀¹₁²₂³₃⁴₄⁵₅⁶₆⁷₇⁸₈⁹₉].

Note: The above list is for illustration only.A standard profile is provided to support the use of Mathematical Compatibility Notation Profile in identifiers.SeeSection 7.1,Mathematical Compatibility Notation Profile.

If, instead of listing these characters explicitly, the profile had chosen to useproperties or combinations of properties, that might result in includingundesired characters.

For example,\p{General_Category=Other_Number} is the general category setcontaining the subscript and superscript digits.But it also includes the compatibility characters [⑴ 🄂 ⒈], which arenot needed for technical notations,and are very likely inappropriate for identifiers—on multiple counts.

On the other hand, a language that allows currency symbols in identifiers could have\p{General_Category=Currency_Symbol} as a profile,since that property matches the intent.

Similarly, a profile based on adding entire blocks is likely to include unintended characters,or to miss ones that are desired.For the use of blocks seeAnnex A, Character Blocks,in [UTS18].

Defining a profile by use of a property also needs to take account of the fact thatunless the property is designed to be stable (such as XID_Continue),code points could be removed in a future version of Unicode.If the profile also needs stable identifiers (backwards compatible),then it must take additional measures.SeeUAX31-R1b Stable Identifiers.

Implementations that require identifier closureunder normalization should ensure that any custom profile preserves identifier closureunder the chosen normalization form. SeeSection 5.1.3,Identifier Closure Under Normalization. The example cited above regarding subscripts and superscripts preserves identifier closure underNormalization Forms C and D, butnot under Forms KC and KD.Under NFKC and NFKD, the subscript and superscript parentheses and operators normalizeto their ASCII counterparts.If an implementation that uses this profile relies on identifier closure under normalization, itshould conform toUAX31-R4 using NFC, not NFKC.

Note: While default identifiers are less open-ended than immutable identifiers,they are still subject to spoofing issues arising from invisible characters,visually identical characters, or bidirectional reordering causing distinct sequences to appearin the same order.Where spoofing concerns are relevant, the mechanisms described inUnicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39],should be used.For the specific case of programming languages and programming environments,recommendations are provided inUnicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

UAX31-R1a.RestrictedFormat Characters:This clause has been removed.

The characters that were added when meetingthis requirement are now part of the default; the contextual checks required by thisrequirement remain as part of the General Security Profile in Unicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39].

UAX31-R1b.StableIdentifiers:To meet this requirement, an implementation shallguarantee that identifiers are stable across versions of the UnicodeStandard: that is, once a string qualifies as an identifier, it doesso in all future versions of the Unicode Standard.

Note: The UAX31-R1b requirement isrelevant when an identifier definition is based on property assignments from anunversioned reference to the Unicode Standard, as property assignments maychange in a future version of the standard. It is typically achieved by usinga small list of characters that qualified as identifier charactersin some previous version of Unicode.SeeSection 2.5,Backward Compatibility.Where profiles are allowed,management of those profiles may also be required to guarantee backwardscompatibility. Typically such management also usesa list of characters that qualified previously.Because of the stability policy [Stability],if an implementation meets either requirementUAX31-R1 orUAX31-R2 without declaring aprofile, that implementation also meets requirement UAX31-R1b.

Example: Consider an identifier definition which usesUAX31-R1 default identifiers with a profile that adds digits(characters with General_Category=Nd) to the setStart, and uses anunversioned reference to the Unicode Character Database,with a minimum version of 5.2.0.
With property assignments from Unicode Version 5.2.0, both᧚ (U+19DA) andA᧚ (U+0041, U+19DA) are valid identifiersunder this definition: U+19DA has General_Category=Nd.
In Unicode Version 6.0.0, U+19DA has General_Category=No.The identifierA᧚ (U+0041, U+19DA)remains valid, because XID_Continue includes any characters that used to be XID_Continue.However,᧚ is not a valid identifier, because U+19DA is nolonger in the set [:Nd:].
In order to meet requirementUAX31-R1b, the definition wouldneed to be changed to add to the setStart all characters that have theproperty General_Category=Nd in any version of Unicode starting from Unicode 5.2.0and up to the version used by the implementation.

2.1CombiningMarks

Combining marks are accounted for in identifier syntax: a composedcharacter sequence consisting of a base character followed by anynumber of combining marks is valid in an identifier. Combining marksare required in the representation of many languages, and theconformance rules inChapter 3, Conformance, of [Unicode] require theinterpretation of canonical-equivalent character sequences. Thesimplest way to do this is to require identifiers in the NFC format(or transform them into that format); seeSection 5,Normalization and Case.

Enclosing combining marks (such as U+20DD..U+20E0) are excluded fromthe definition of thelexical classID_Continue,because the composite characters that result from their compositionwith letters are themselves not normally considered validconstituents of these identifiers.

2.2ModifierLetters

Modifier letters (General_Category=Lm) are also included in thedefinition of the syntax classes for identifiers. Modifier lettersare often part of natural language orthographies and are useful formaking word-like identifiers in formal languages. On the other hand,modifier symbols (General_Category=Sk), which are seldom a part oflanguage orthographies, are excluded from identifiers. For morediscussion of modifier letters and how they function, see [Unicode].

Implementations that tailor identifier syntax for specialpurposes may wish to take special note of modifier letters, as insome cases modifier letters have appearances, such as raised commas,which may be confused with common syntax characters such as quotationmarks.

2.3Layout and FormatControl Characters

Certain Unicode characters are known asDefault_Ignorable_Code_Points. These include variation selectors andcharacters used to control joining behavior, bidirectional orderingcontrol, and alternative formats for display (having theGeneral_Category value of Cf). The use ofdefault-ignorable characters in identifiers is problematic, firstbecause the effects they represent are stylistic or otherwise out ofscope for identifiers, and second because the characters themselvesoften have no visible display. It is also possible to misapply thesecharacters such that users can create strings that look the same butactually contain different characters, which can create securityproblems. In environments where spoofing concerns are paramount, such as top-level domain names, identifiers should also be limited tocharacters that are case-folded and normalized with the NFKC_Casefoldoperation. For more information, seeSection 5,Normalization and Case andUTR#36: Unicode Security Considerations [UTR36].

While not all Default_Ignorable_Code_Points are in XID_Continue, the variation selectors and joining controlsare included in XID_Continue. These variation selectors are used in standardized variation sequences, sequences from the Ideographic Variation Database, and emoji variation sequences.The joining controls are used in the orthographies of some languages, as well as in emoji ZWJ sequences. However, these characters are subject to the same considerations as other Default_Ignorable_Code_Points listed above. Because variation selectors and joining controls request a difference in display but do not guarantee it, they do not work well in general-purpose identifiers. A profile should be used to remove them from general-purpose identifiers (along with other Default_Ignorable_Code_Points), unless their use is required in a particular domain, such as in a profile that includes emoji. For such a profile it may be useful to explicitly retain or even add certain Default_Ignorable_Code_Points in the identifier syntax.

For programming language identifiers, spoofing issues are more comprehensively addressed by higher-level diagnostics rather than at the syntactic level. See Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

Comparison. In any environment where the display form for identifiers differs from the form used to compare them, Default_Ignorable_Code_Points should be ignored for comparison. For example, this applies to case-insensitive identifiers. For more information, seeSection 1.3,Display Format.

Notes:
An implementation ofUAX31-R4 andUAX31-R5 (Equivalent Case and Compatibility-Insensitive Identifiers) that compares identifiers under theidentifier caseless match defined by D147 [Unicode], that is, canonical decomposition (NFD) followed by the toNFKC_Casefold operation, ignores Default_Ignorable_Code_Points.
The Default_Ignorable_Code_Point property values are not guaranteed to be stable.However, the derivation of the NFKC_Casefold property will be changed if necessary to ensure that it remains stable for default identifiers.That means that the toNFKC_Casefold operation applied to a string with only characters in XID_Continue in a version of Unicode will have the same results in any future version of Unicode.

In addition, a standard profile is provided to exclude all Default_Ignorable_Code_Points; seeSection 7,Standard Profiles. Note however that, even if Default_Ignorable_Code_Points are excluded, spoofing issues remain unless the mechanisms in Unicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39] are utilized.

The General Security Profile defined in Section 3.1, General Security Profile for Identifiers, inUTS #39, Unicode Security Mechanisms [UTS39], excludes all Default_Ignorable_Code_Points by default, including variation selectors.

2.4Specific CharacterAdjustments

Specific identifier syntaxes can be treated as tailorings (orprofiles)of the generic syntax based on character properties. For example, SQLidentifiers allow an underscore as an identifier continue, but not asan identifier start; C identifiers allow an underscore as either anidentifier continue or an identifier start. Specific languages mayalso want to exclude the characters that have a Decomposition_Typeother than Canonical or None, or to exclude some subset of those,such as those with a Decomposition_Type equal to Font.

There are circumstances in which identifiers are expected to morefully encompass words or phrases used in natural languages.

For more natural-language identifiers, a profile should allow thecharacters inTable 3,Table3a, andTable 3b inidentifiers, unless there are compelling reasons not to. Most additions to identifiers are restrictedto medial positions. These are listed inTable 3a. A few characters canalso occur in final positions, and are listed inTable 3b. The contents of thesetables may overlap.

In some environments even spaces and @are allowed in identifiers, such as in SQL:SELECT * FROMEmployee Pension.

Table 3.Optional Characters for Start

Code Point	Character	Name
0024	$	DOLLAR SIGN
005F	_	LOW LINE

Table 3a.Optional Characters for Medial

Code Point	Character	Name
0027	'	APOSTROPHE
002D	-	HYPHEN-MINUS
002E	.	FULL STOP
003A	:	COLON
058A	֊	ARMENIAN HYPHEN
05F4	״	HEBREW PUNCTUATION GERSHAYIM
0F0B	་	TIBETAN MARK INTERSYLLABIC TSHEG
2010	‐	HYPHEN
2019	’	RIGHT SINGLE QUOTATION MARK
2027	‧	HYPHENATION POINT
30A0	゠	KATAKANA-HIRAGANA DOUBLE HYPHEN

Table 3b.Optional Characters forContinue

Code Point	Character	Name
05F3	׳	HEBREW PUNCTUATION GERESH

In UnicodeSet notation, the characters in these tables are:

Table 3: [\$_]
Table 3a: ['\-.\:֊״་‐’‧゠・]
Table 3b: [ ׳]

In identifiers that allow for unnormalized characters, thecompatibility equivalents of the characters listed inTable 3,Table 3a, andTable 3bmay also be appropriate.

For more information on characters that may occur in words, and thosethat may be used in name validation, see Section 4, Word Boundaries, in [UAX29].

Some scripts are not in customary modern use, and thusimplementations may want to exclude them from identifiers. Theseinclude historic and obsolete scripts, scripts usedmostly liturgically, and regional scripts used only in very smallcommunities or with very limited current usage. Some scripts also have unresolved architectural issues that make them currently unsuitable for identifiers. The scripts inTable 4,Excluded Scripts are recommended for exclusion from identifiers.

Table 4.Excluded Scripts

Property Notation	Description
`\p{script=Aghb}`	Caucasian Albanian
`\p{script=Ahom}`	Ahom
`\p{script=Armi}`	Imperial Aramaic
`\p{script=Avst}`	Avestan
`\p{script=Bass}`	Bassa Vah
`\p{script=Bhks}`	Bhaiksuki
`\p{script=Brah}`	Brahmi
`\p{script=Bugi}`	Buginese
`\p{script=Buhd}`	Buhid
`\p{script=Cari}`	Carian
`\p{script=Chrs}`	Chorasmian
`\p{script=Copt}`	Coptic
`\p{script=Cpmn}`	Cypro-Minoan
`\p{script=Cprt}`	Cypriot
`\p{script=Diak}`	Dives Akuru
`\p{script=Dogr}`	Dogra
`\p{script=Dsrt}`	Deseret
`\p{script=Dupl}`	Duployan
`\p{script=Egyp}`	Egyptian Hieroglyphs
`\p{script=Elba}`	Elbasan
`\p{script=Elym}`	Elymaic
`\p{script=Glag}`	Glagolitic
`\p{script=Gong}`	Gunjala Gondi
`\p{script=Gonm}`	Masaram Gondi
`\p{script=Goth}`	Gothic
`\p{script=Gran}`	Grantha
`\p{script=Hano}`	Hanunoo
`\p{script=Hatr}`	Hatran
`\p{script=Hluw}`	Anatolian Hieroglyphs
`\p{script=Hmng}`	Pahawh Hmong
`\p{script=Hung}`	Old Hungarian
`\p{script=Ital}`	Old Italic
`\p{script=Kawi}`	Kawi
`\p{script=Khar}`	Kharoshthi
`\p{script=Khoj}`	Khojki
`\p{script=Kits}`	Khitan Small Script
`\p{script=Kthi}`	Kaithi
`\p{script=Lina}`	Linear A
`\p{script=Linb}`	Linear B
`\p{script=Lyci}`	Lycian
`\p{script=Lydi}`	Lydian
`\p{script=Maka}`	Makasar
`\p{script=Mahj}`	Mahajani
`\p{script=Mani}`	Manichaean
`\p{script=Marc}`	Marchen
`\p{script=Medf}`	Medefaidrin
`\p{script=Mend}`	Mende Kikakui
`\p{script=Merc}`	Meroitic Cursive
`\p{script=Mero}`	Meroitic Hieroglyphs
`\p{script=Modi}`	Modi
`\p{script=Mong}`	Mongolian
`\p{script=Mroo}`	Mro
`\p{script=Mult}`	Multani
`\p{script=Nagm}`	Nag Mundari
`\p{script=Narb}`	Old North Arabian
`\p{script=Nand}`	Nandinagari
`\p{script=Nbat}`	Nabataean
`\p{script=Nshu}`	Nushu
`\p{script=Ogam}`	Ogham
`\p{script=Orkh}`	Old Turkic
`\p{script=Osma}`	Osmanya
`\p{script=Ougr}`	Old Uyghur
`\p{script=Palm}`	Palmyrene
`\p{script=Pauc}`	Pau Cin Hau
`\p{script=Perm}`	Old Permic
`\p{script=Phag}`	Phags-pa
`\p{script=Phli}`	Inscriptional Pahlavi
`\p{script=Phlp}`	Psalter Pahlavi
`\p{script=Phnx}`	Phoenician
`\p{script=Prti}`	Inscriptional Parthian
`\p{script=Rjng}`	Rejang
`\p{script=Runr}`	Runic
`\p{script=Samr}`	Samaritan
`\p{script=Sarb}`	Old South Arabian
`\p{script=Sgnw}`	SignWriting
`\p{script=Shaw}`	Shavian
`\p{script=Shrd}`	Sharada
`\p{script=Sidd}`	Siddham
`\p{script=Sind}`	Khudawadi
`\p{script=Sora}`	Sora Sompeng
`\p{script=Sogd}`	Sogdian
`\p{script=Sogo}`	Old Sogdian
`\p{script=Soyo}`	Soyombo
`\p{script=Tagb}`	Tagbanwa
`\p{script=Takr}`	Takri
`\p{script=Tang}`	Tangut
`\p{script=Tglg}`	Tagalog
`\p{script=Tirh}`	Tirhuta
`\p{script=Tnsa}`	Tangsa
`\p{script=Toto}`	Toto
`\p{script=Ugar}`	Ugaritic
`\p{script=Vith}`	Vithkuqi
`\p{script=Wara}`	Warang Citi
`\p{script=Xpeo}`	Old Persian
`\p{script=Xsux}`	Cuneiform
`\p{script=Yezi}`	Yezidi
`\p{script=Zanb}`	Zanabazar Square

Some characters used with recommended scripts may still be problematic for identifiers, for example because they are part of extensions that are not in modern customary use, and thus implementations may want to exclude them from identifiers. These include characters for historic and obsolete orthographies, characters used mostly liturgically, and in orthographies for languages used only in very small communities or with very limited current or declining usage. Some characters also have architectural issues that may make them unsuitable for identifiers. SeeUTS #39, Unicode Security Mechanisms [UTS39] for more information.

The scripts listed inTable 5,Recommended Scripts are generally recommended for use inidentifiers. These are in widespread modern customary use, or areregional scripts in modern customary use by large communities.

Table 5.Recommended Scripts

Property Notation	Description
`\p{script=Zyyy}`	Common
`\p{script=Zinh}`	Inherited
`\p{script=Arab}`	Arabic
`\p{script=Armn}`	Armenian
`\p{script=Beng}`	Bengali
`\p{script=Bopo}`	Bopomofo
`\p{script=Cyrl}`	Cyrillic
`\p{script=Deva}`	Devanagari
`\p{script=Ethi}`	Ethiopic
`\p{script=Geor}`	Georgian
`\p{script=Grek}`	Greek
`\p{script=Gujr}`	Gujarati
`\p{script=Guru}`	Gurmukhi
`\p{script=Hang}`	Hangul
`\p{script=Hani}`	Han
`\p{script=Hebr}`	Hebrew
`\p{script=Hira}`	Hiragana
`\p{script=Kana}`	Katakana
`\p{script=Knda}`	Kannada
`\p{script=Khmr}`	Khmer
`\p{script=Laoo}`	Lao
`\p{script=Latn}`	Latin
`\p{script=Mlym}`	Malayalam
`\p{script=Mymr}`	Myanmar
`\p{script=Orya}`	Oriya
`\p{script=Sinh}`	Sinhala
`\p{script=Taml}`	Tamil
`\p{script=Telu}`	Telugu
`\p{script=Thaa}`	Thaana
`\p{script=Thai}`	Thai
`\p{script=Tibt}`	Tibetan

As of Unicode 10.0, there is no longer a distinction betweenaspirational use and limited use scripts, as this has not provento be productive for the derivation of identifier-related classesused in security profiles. (SeeUTS #39, Unicode Security Mechanisms[UTS39].) Thus the aspirational use scriptsinTable 6,Aspirational Use Scripts have been recategorizedas Limited Use and moved toTable 7,Limited Use Scripts.

Table 6. Aspirational Use Scripts (Withdrawn)

Property Notation	Description
intentionally blank

Modern scripts that are in more limited use are listed inTable 7,Limited Use Scripts.To avoid security issues, some implementations may wish to disallowthe limited-use scripts in identifiers. For more information onusage, see the Unicode Locale project [CLDR].

Table 7.Limited Use Scripts

Property Notation	Description
`\p{script=Adlm}`	Adlam
`\p{script=Bali}`	Balinese
`\p{script=Bamu}`	Bamum
`\p{script=Batk}`	Batak
`\p{script=Cakm}`	Chakma
`\p{script=Cans}`	Canadian Aboriginal Syllabics
`\p{script=Cham}`	Cham
`\p{script=Cher}`	Cherokee
`\p{script=Hmnp}`	Nyiakeng Puachue Hmong
`\p{script=Java}`	Javanese
`\p{script=Kali}`	Kayah Li
`\p{script=Lana}`	Tai Tham
`\p{script=Lepc}`	Lepcha
`\p{script=Limb}`	Limbu
`\p{script=Lisu}`	Lisu
`\p{script=Mand}`	Mandaic
`\p{script=Mtei}`	Meetei Mayek
`\p{script=Newa}`	Newa
`\p{script=Nkoo}`	Nko
`\p{script=Olck}`	Ol Chiki
`\p{script=Osge}`	Osage
`\p{script=Plrd}`	Miao
`\p{script=Rohg}`	Hanifi Rohingya
`\p{script=Saur}`	Saurashtra
`\p{script=Sund}`	Sundanese
`\p{script=Sylo}`	Syloti Nagri
`\p{script=Syrc}`	Syriac
`\p{script=Tale}`	Tai Le
`\p{script=Talu}`	New Tai Lue
`\p{script=Tavt}`	Tai Viet
`\p{script=Tfng}`	Tifinagh
`\p{script=Vaii}`	Vai
`\p{script=Wcho}`	Wancho
`\p{script=Yiii}`	Yi

This is the recommendation as of the current version of Unicode; asnew scripts are added to future versions of Unicode, characters and scripts maybe added to Tables4,5, and7. Characters may also bemoved from one table to another as more information becomesavailable.

There are a few special cases:

The Common and Inherited script values[\p{script=Zyyy}\p{script=Zinh}] are used widely with other scripts,rather than being scripts per se. See also the Script_Extensionsproperty in the Unicode Character Database [UAX44].
The Unknown script \p{script=Zzzz} is used for Unassignedcharacters.
Braille \p{script=Brai} consists only of symbols
Katakana_Or_Hiragana \p{script=Hrkt} is empty. This value was usedin earlier versions, but is no longer used.
With respect to the scripts Balinese, Cham, Ol Chiki, Vai,Kayah Li, and Saurashtra, there may be large communities of peoplespeaking an associated language, but the script itself is not inwidespread use. However, there are significant revival efforts.
Bopomofo is used primarily in education.

For programming language identifiers, normalization and case have anumber of important implications. For a discussion of these issues,seeSection 5,Normalizationand Case.

2.5BackwardCompatibility

Unicode General_Category values are kept as stable as possible, butthey can change across versions of the Unicode Standard. The bulk ofthe characters having a given value are determined by otherproperties, and the coverage expands in the future according to theassignment of those properties. In addition, the Other_ID_Startproperty provides a small list of characters that qualified asID_Start characters in some previous version of Unicode solely on thebasis of their General_Category properties, but that no longerqualify in the current version.

The Other_ID_Start property includes characters such as thefollowing:

U+2118 ( ℘ ) SCRIPT CAPITAL P
U+212E ( ℮ ) ESTIMATED SYMBOL
U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C ( ゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Similarly, the Other_ID_Continue property adds a small list ofcharacters that qualified as ID_Continue characters in some previousversion of Unicode solely on the basis of their General_Categoryproperties, but that no longer qualify in the current version.

The Other_ID_Continue property includes characters such as thefollowing:

U+1369 ETHIOPIC DIGIT ONE...U+1371 ETHIOPIC DIGIT NINE
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+19DA ( ᧚ ) NEW TAI LUE THAM DIGIT ONE

The exact list of characters covered by the Other_ID_Start andOther_ID_Continue properties depends on the version of Unicode. Formore information, see Unicode Standard Annex #44, “Unicode CharacterDatabase” [UAX44].

The Other_ID_Start and Other_ID_Continue properties are thusdesigned to ensure that the Unicode identifier specification isbackward compatible. Any sequence of characters that qualified as anidentifier in some version of Unicode will continue to qualify as anidentifier in future versions.

If a specification tailors the Unicode recommendations foridentifiers, then this technique can also be used to maintainbackwards compatibility across versions.

3Immutable Identifiers

The disadvantage of working with the lexical classes definedpreviously is the storage space needed for the detailed definitions,plus the fact that with each new version of the Unicode Standard newcharacters are added, which an existing parser would not be able torecognize. In other words, the recommendations based on that tableare not upwardly compatible.

This problem can be addressed by turning the question around.Instead of defining the set of code points that are allowed, define asmall, fixed set of code points that are reserved for syntactic useand allow everything else (including unassigned code points) as partof an identifier. All parsers written to this specification wouldbehave the same way for all versions of the Unicode Standard, becausethe classification of code points is fixed forever.

The drawback of this method is that it allows “nonsense” to be partof identifiers because the concerns of lexical classification and ofhuman intelligibility are separated. Human intelligibility can,however, be addressed by other means, such as usage guidelines thatencourage a restriction to meaningful terms for identifiers. For anexample of such guidelines, see the XML specification by the W3C,Version 1.0 5th Edition or later [XML].

By increasing the set of disallowed characters, a reasonablyintuitive recommendation for identifiers can be achieved. Thisapproach uses the full specification of identifier classes, as of aparticular version of the Unicode Standard, and permanently disallowsany characters not recommended in that version for inclusion inidentifiers. All code points unassigned as of that version would beallowed in identifiers, so that any future additions to the standardwould already be accounted for. This approach ensures both upwardlycompatible identifier stability and a reasonable division ofcharacters into those that do and do not make human sense as part ofidentifiers.

With or without such fine-tuning, such a compromise approachstill incurs the expense of implementing large lists of code points.While they no longer change over time, it is a matter of choicewhether the benefit of enforcing somewhat word-like identifiersjustifies their cost.

Alternatively, one can use the properties described below andallow all sequences of characters to be identifiers that are neitherPattern_Syntax nor Pattern_White_Space. This has the advantage ofsimplicity and small tables, but allows many more “unnatural”identifiers.

UAX31-R2.Immutable Identifiers:To meet this requirement,an implementation shallchoose eitherUAX31-R2-1 orUAX31-R2-2.

UAX31-R2-1.Define identifiers to be any non-emptystring of characters that contains no character having any of thefollowing property values:

Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True

UAX31-R2-2.Declare that it uses aprofileofUAX31-R2-1and define that profile with a precise specification of thecharacters and character sequences that are added to or removed from the sets of code pointsdefined by these properties and/or provide a list of additional constraints on identifiers.

Note: The expectation from an implementation meeting requirement UAX31-R2 Immutable Identifiers is that it will never change its definition of identifiers; in particular, that it will not switch to UAX31-R1 Default Identifiers. However, the downsides of normalization issues and the inapplicability of measures guarding against spoofing attacks may warrant such a change in definition. In such circumstances, a profile should be used to extend XID_Start and XID_Continue to cover likely existing usages. SeeSection 3.3, Language Evolution, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

In its profile, a specification can define identifiers to bemore in accordance with the Unicode identifier definitions at thetime the profile is adopted, while still allowing for strictimmutability. For example, an implementation adopting a profile aftera particular version of Unicode is released (such as Unicode 5.0)could define the profile as follows:

All characters satisfyingUAX31-R1Default Identifiers according to Unicode 5.0
Plus all code points unassigned in Unicode 5.0 that do nothave the property values specified inUAX31-R2 Immutable Identifiers.

This technique allows identifiers to have a more naturalformat—excluding symbols and punctuation already defined—yet alsoprovides absolute code point immutability.

Immutable identifiers are intended for those cases (like XML) thatcannot update across versions of Unicode, and do not requireinformation about normalization form, or properties such asGeneral_Category and Script. Immutable identifers that allowunassigned characters cannot provide for normalization formsor these properties, which means that they:

cannot be compared for NFC, NFKC, or case-insensitive equality
are unsuitable for restrictions such as those in UTS #39

For best practice, a profile disallowing unassigned characters should be provided where possible.

Specifications should also include guidelines and recommendations forthose creating new identifiers. AlthoughUAX31-R2 Immutable Identifiers permits a wide range ofcharacters, as a best practice identifiers should be in the formatNFKC, without using any unassigned characters. For more informationon NFKC, see Unicode Standard Annex #15, “Unicode NormalizationForms” [UAX15].

4Whitespace and Syntax

Most programming languages have a concept ofwhitespace as part of their lexical structure, as well as some set ofcharacters that are disallowed in identifiers but have syntacticuse, such as arithmetic operators.Beyond general programming languages,there are also many circumstances where software interpretspatterns that are a mixture of literal characters, whitespace, and syntaxcharacters. Examples include regular expressions, Java collationrules, Excel or ICU number formats, and many others. In the past,regular expressions and other formal languages have been forced touse clumsy combinations of ASCII characters for their syntax. AsUnicode becomes ubiquitous, some of these will start to use non-ASCIIcharacters for their syntax: first as more readable optionalalternatives, then eventually as the standard syntax.

For forward and backward compatibility, it is advantageous to have afixed set of whitespace and syntax code points.This follows the recommendations that the Unicode Consortium has maderegarding completely stable identifiers, and the practice that isseen in XML 1.0, 5th Edition or later [XML]. (In particular, theUnicode Consortium is committed to not allocating characters suitablefor identifiers in the range U+2190..U+2BFF, which is being used byXML 1.0, 5th Edition.)

As of Unicode 4.1, two Unicode character properties are definedto provide for stable syntax: Pattern_White_Space andPattern_Syntax. Particular languages may, of course,override these recommendations, for example, by adding or removingother characters for compatibility with ASCII usage.

For stability, the values of these properties are absolutelyinvariant, not changing with successive versions of Unicode. Ofcourse, this does not limit the ability of the Unicode Standard toencode more symbol or whitespace characters, but the default sets of syntax andwhitespace code points recommended for use in computer languages will notchange.

UAX31-R3.Pattern_White_Spaceand Pattern_Syntax Characters:To meet this requirement, animplementation shallmeet bothUAX31-R3a andUAX31-R3b.

Note: When meeting requirementUAX31-R3 with no profile, all characters exceptthose that have the Pattern_White_Space or Pattern_Syntax propertiesare available for use in the definition of identifiers or literals.

4.1Whitespace

Many computer languages treat two categories of whitespace differently: horizontal space (such as the ASCII horizontal tabulation and space), and line terminators.

When a syntax supports non-ASCII characters, it is useful to consider a third category:ignorable format controls. Ignorable format controls may be inserted between lexical elements in order to resolve bidirectional ordering issues, as described inSection 4.1.1,Bidirectional Ordering. The insertion of these characters does not change the meaning of the program; in particular, they are not spacing characters. SeeSection 4.1.2,Required Spaces.

Note: Allowing for the insertion of ignorable format controls does not prevent spoofing based on bidirectional reordering.In order to guard against such spoofing, implementations should make use of the higher-level protocols and conversion to plain text described in Unicode Standard Annex #9, “Unicode Bidirectional Algorithm” [UAX9]. See Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

Note: Since these characters are allowed only where a boundary would, in their absence, exist between lexical elements, an implementation could ignore them when lexing, and then consider as illegal any lexical element that contains them. An exception must be made for comments and strings, which should be able to freely contain these characters.

Implementations should also allow these characters in other contexts where reordering issues could arise. See Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

UAX31-R3a.Pattern_White_Space Characters:To meet this requirement, animplementation shallchoose eitherUAX31-R3a-1 orUAX31-R3a-2.

UAX31-R3a-1.Use Pattern_White_Space characters as the set of characters interpreted as whitespace in parsing, as follows:

A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line:
1. U+000A (line feed)
2. U+000B (vertical tabulation)
3. U+000C (form feed)
4. U+000D (carriage return)
5. U+0085 (next line)
6. U+2028 LINE SEPARATOR
7. U+2029 PARAGRAPH SEPARATOR
The Pattern_White_Space characters with the property Default_Ignorable_Code_Point shall be treated as ignorable format controls; they shall be allowed in the contextsUAX31-I1,UAX31-I2, andUAX31-I3 defined inSection 4.1.3,Contexts for Ignorable Format Controls, where their insertion shall have no effect on the meaning of the program.
All other characters in Pattern_White_Space shall be interpreted as horizontal space.

UAX31-R3a-2.Declare that it uses aprofileofUAX31-R3a-1and define that profile with a precise specification of thecharacters that are added to or removed from the set of code pointsdefined by the Pattern_White_Space property, and of any changes to the criteria under which a character or sequence of characters is interpreted as an end of line, as ignorable format controls, or as horizontal space.

Note: The characters to be treated as ignorable format controls under item 2 ofUAX31-R3a-1 are U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK. The characters to be treated as horizontal space under item 3 ofUAX31-R3a-1 are U+0020 SPACE and U+0009 (horizontal tabulation, TAB).

Note: The characters LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK are two of the Implicit Directional Marks defined bySection 2.6, Implicit Directional Marks, in Unicode Standard Annex #9, “Unicode Bidirectional Algorithm” [UAX9]. The third one, ARABIC LETTER MARK, is used far less frequently than the others, even in Arabic text; its behavior differs subtly from RIGHT-TO-LEFT MARK in ways that are not usually relevant to the ordering of source code. If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.

Note: Failing to interpret all characters listed in item 1 ofUAX31-R3a-1 as line terminators would lead to spoofing issues; see Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

4.1.1Bidirectional Ordering

RequirementUAX31-R3a is relevant even for languages that do notuse immutable identifiers, or that have lexical structure outside of thecategories of syntax and whitespace characters. In particular, the set ofPattern_White_Space characters is chosen to make it possible to correctbidirectional ordering issues that can arise in a wide range of programminglanguages, visually obfuscating the logic of expressions.In the absence of higher-level protocols (see Section 4.3,Higher-Level Protocols, in[UAX9]), tokens may be visuallyreordered by the Unicode Bidi Algorithm in bidirectional source text,producing a visual result that conveys a different logical intent.To remedy that, two implicit directional marks are among Pattern_White_Spacecharacters; if these can be freely inserted between tokens, implicitdirectional marksconsistent with the paragraph direction can be used toensure that the visual order of tokens matches their logical order.

Example: Consider the following two lines:
(1)x + tav == 1
(2)x + תו == 1
Internally, they are the same except that the ASCII identifiertav in line (1) is replaced by the Hebrewidentifierתו in line (2). However, with a plain text display (with left-to-right paragraph direction) the userwill be misled, thinking that line (2) is a comparison between(x + 1) andתו, whereas it is actually acomparison between(x + תו) and1.The misleading rendering of (2) occurs because the directionality of the identifier תוinfluences subsequent weakly-directional tokens; inserting a left-to-rightmark after the identifierתו stops it from influencing the remainder of theline, and thus yields a better rendering in plain text with left-to-rightparagraph direction, as demonstrated in the following table, wherein characterswhose ordering is affected by that identifier have been highlighted.
Underlying Representation Display (LTR paragraph direction)
x + ת ו = = 1 x +תו == 1
x + ת ו ⟨LRM⟩ = = 1 x +תו‎ == 1
Section 5.2, Conversion to Plain Text, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55],specifies an algorithm for the automatic insertion of LRM characters.

Underlying Representation	Display (LTR paragraph direction)
`x`		`+`		`ת`	`ו`		`=`	`=`		`1`	`x +תו == 1`
`x`		`+`		`ת`	`ו`	⟨LRM⟩		`=`	`=`		`1`	`x +תו‎ == 1`

Note: Left-to-right marks are used for this purpose when the maindirection is left–to-right. Correspondingly, right-to-left marks are usedwhen the main direction is right-to-left.

4.1.2Required Spaces

Since the implicit directional marks are nonspacing, where a syntax requiresa sequence of spaces (such as between identifiers), it should require that atleast one of those be neither LEFT-TO-RIGHT MARK nor RIGHT-TO-LEFT MARK. Thevisual appearance would otherwise be too confusing to readers: “else⟨LRM⟩if”would be seen by the user as “elseif” but parsed by the compiler as “else if”,whereas “else⟨LRM⟩ if” would be seen and parsed as “else if” and be harmless.

4.1.3Contexts for Ignorable Format Controls

Implementations should at least allow for the insertion of ignorable format controls in the following contexts, illustrated by examples wherein the ignorable format control is represented by ⟨LRM⟩.

UAX31-I1. Adjacent to lexical horizontal space (within a sequence of lexical horizontal spaces, or at the start or end of such a sequence).

Example: Between the following keywords separated by a space:
else⟨LRM⟩if

Note: The phrase “lexical horizontal space” refers to characters that are not merely in the set of horizontal space characters, but are also in a context where they are lexically spaces. For instance, it does not include horizontal space characters in string literals. Implementations should permit these characters in string literals, but in such a literal, their insertion has an effect on the meaning of the program, as they are then present in the string represented by that literal.

UAX31-I2. As optional space, that is, wherever horizontal space could be inserted without changing the meaning of the program.

Example: Before the plus sign in the following arithmetic expression:
x⟨LRM⟩+1

UAX31-I3. At the start and end of a lexical line.

Example: Before the word import in the following line of Python:
⟨LRM⟩import unicodedata

Note: As is the case forUAX31-I1, the start and end of a “lexical line” inUAX31-I3 does not include the start and end of a line in a multiline string literal, respectively. This context is distinct fromUAX31-I2 in languages where leading or trailing spaces are meaningful.

4.2Syntax

The lexical structure of formal languages involves characters that are not allowed in identifiers and are not whitespace, but that have some special lexical significance other than being literal characters (such as in string literals) or ignored (such as in comments). These are referred to in this document ascharacters with syntactic use.

Examples of characters with syntactic use include:

decimal marks in numeric literals
arithmetic operators, such as+,-,*,/
parentheses and other brackets
characters in comment delimiters, such as#,/*,--, or⍝
quotation marks delimiting strings
characters such as\ introducing escape sequences

It is useful to bound the set of characters with syntactic use.This makes it possible to build tools that handle source code, but do not validate it, such assyntax highlighters, in a forward-compatible way; see Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].It further provides a stable set of characters that can be used for user-defined operators.In addition, this allows for backward compatibility of literals (including patterns), as described inSection 4.3,Pattern Syntax.

UAX31-R3b.Pattern_Syntax Characters:To meet this requirement, animplementation shallchoose eitherUAX31-R3b-1 orUAX31-R3b-2.

UAX31-R3b-1.Use Pattern_Syntax characters as the set of characterswith syntactic use. The following sets shall be disjoint:

characters allowed in identifiers
characters treated as whitespace
characters with syntactic use

UAX31-R3b-2.Declare that it uses aprofileofUAX31-R3b-1and define that profile with a precise specification of thecharacters that are added to or removed from the set of code pointsdefined by the Pattern_Syntax property.

Note: When meeting requirementUAX31-R3b, characters allowed in identifiers may be given special significance in the syntax even when they are not part of identifiers.
For instance, in a language which uses the C syntax for hexadecimal literals and meets requirementUAX31-R1, the literal0xDEADBEEF consists entirely of identifier characters, yet the0x has special significance in the syntax, and the characters after that prefix are subject to special restrictions (only 0 through 9 and A through F are allowed).
However, characters outside of those allowed in identifiers, those treated as whitespace, and the set [:Pattern_Syntax:] cannot be given special significance in the syntax. For instance, if a language meets requirementsUAX31-R1 andUAX31-R3 with no profile and allows for user-defined operators, that language cannot allow the user to define an operator 🐈.
Characters outside of those allowed in identifiers, those treated as whitespace, and those with syntactic use can still be allowed in a program, for instance, as part of string literals or comments.

4.2.1User-Defined Operators

Some programming languages allow for user-defined operators. When meeting requirementUAX31-R3b, the set of characters that can be allowed in operators is limited; however, that leaves open the exact definition of operators. In order to avoid ambiguities in lexical analysis, operators should not be allowed to contain characters that may be found at the beginning of an identifier or literal; for instance,+1 or−x should not be operators.

The following definition avoids such interactions with default identifiers and with numbers.

UAX31-R3c.Operator Identifiers:To meet this requirement, an implementation shall meet requirementUAX31-R3b Pattern_Syntax Characters, and, to determine whether a string is an operator, it shall choose either UAX31-R3c-1 or UAX31-R3c-2.

UAX31-R3c-1.Use definitionUAX31-D1, setting Start to be the set of characters with syntactic use, setting Continue to be the union of the set of characters with syntactic use and the set of characters with General_Category Mn, and leaving Medial empty.

UAX31-R3c-2.Declare that it uses a profile ofUAX31-R3c-1 and define that profile with a precise specification of the characters and character sequences that are added to or removed from Start, Continue, and Medial and/or provide a list of additional constraints on operators.

Note: The set of Pattern_Syntax characters, which is the default for characters with syntactic use, contains some emoji. Implementations may wish to remove them, either to allow for their use in identifiers, or to reduce potential confusion arising from ⚽ being an operator but 🏉 not being one. This may be done using the standard profile forUAX31-R3b Pattern_Syntax Characters defined inSection 7.2,Emoji Profile.
Nonspacing marks are included in Continue because they are part of the representation for many operators, such as some of the negated operators.
Unassigned code points are not characters; they are therefore excluded by this definition.

When meeting this requirement, a profile is likely to be needed depending on the specifics of the syntax. For instance, a programming language wherein string literals start with " should remove that character from the characters allowed in operators.

4.3Pattern Syntax

With a fixed set of whitespace and syntax code points, apattern language can have a policy requiring all possible syntaxcharacters (even ones currently unused) to be quoted if they areliterals. Using this policy preserves the freedom to extend thesyntax in the future by using those characters. Past patterns onfuture systems will always work; future patterns on past systems willsignal an error instead of silently producing the wrong results.Consider the following scenario, for example.

In version 1.0 of program X, '≈' is a reserved syntaxcharacter; that is, it does not perform an operation, and it needsto be quoted. In this example, '\'quotes the nextcharacter; that is, it causes it to be treated as a literal insteadof a syntax character. In version 2.0 of program X, '≈' isgiven a real meaning—for example, “uppercase the subsequentcharacters”.
The pattern abc...\≈...xyz works on both versions 1.0 and2.0, and refers to the literal character because it is quoted inboth cases.
The pattern abc...≈...xyz works on version 2.0 anduppercases the following characters. On version 1.0, the engine(rightfully) has no idea what to do with ≈. Rather than silentlyfail (by ignoring ≈ or turning it into a literal), it has theopportunity to signal an error.

Whengenerating rules or patterns, all whitespace and syntaxcode points that are to be literals require quoting, using whateverquoting mechanism is available. For readability, it is recommendedpractice to quote or escape all literal whitespace and default-ignorable code points as well.

Consider the following example, where the items in anglebrackets indicate literal characters:
a<SPACE>b → x<ZERO WIDTH SPACE>y +z;
Because <SPACE> is a Pattern_White_Space character, itrequires quoting. Because <ZERO WIDTH SPACE> is a default-ignorable character, it should also be quoted for readability. So inthis example, if \uXXXX is used for a code point literal, but isresolved before quoting, and if single quotes are used for quoting,this example might be expressed as:
'a\u0020b' → 'x\u200By' + z;

5Normalizationand Case

This section discusses issues that must be taken into accountwhen considering normalization and case folding of identifiers inprogramming languages or scripting languages. Using normalizationavoids many problems where apparently identical identifiers are nottreated equivalently. Such problems can appear both duringcompilation and during linking—in particular across differentprogramming languages. To avoid such problems, programming languagescan normalize identifiers before storing or comparing them. Generallyif the programming language has case-sensitive identifiers, thenNormalization Form C is appropriate; whereas, if the programminglanguage has case-insensitive identifiers, then Normalization Form KCis more appropriate.

Implementations that take normalization and case into accounthave two choices: to treat variants as equivalent, or to disallowvariants.

UAX31-R4.EquivalentNormalized Identifiers:To meet this requirement, an implementationshall specify the Normalization Form and shall provide a precisespecification of the characters that are excluded fromnormalization, if any. If the Normalization Form is NFKC, theimplementation shall apply the modifications in Section 5.1,NFKC Modifications, given by theproperties XID_Start and XID_Continue. Except for identifierscontaining excluded characters, any two identifiers that have thesame Normalization Form shall be treated as equivalent by theimplementation.

UAX31-R5.EquivalentCase-Insensitive Identifiers:To meet this requirement, animplementation shall specify either simple or full case folding, andadhere to the Unicode specification for that folding. Any twoidentifiers that have the same case-folded form shall be treated asequivalent by the implementation.

UAX31-R6.FilteredNormalized Identifiers:To meet this requirement, an implementationshall specify the Normalization Form and shall provide a precisespecification of the characters that are excluded fromnormalization, if any. If the Normalization Form is NFKC, theimplementation shall apply the modifications in Section 5.1,NFKC Modifications, given by theproperties XID_Start and XID_Continue. Except for identifierscontaining excluded characters, allowed identifiers must be in thespecified Normalization Form.

Note: For requirement UAX31-R6, filtering involves disallowing anycharacters in the set \p{NFKC_QuickCheck=No}, or equivalently,disallowing \P{isNFKC}.

UAX31-R7.FilteredCase-Insensitive Identifiers:To meet this requirement, animplementation shall specify either simple or full case folding, andadhere to the Unicode specification for that folding. Except foridentifiers containing excluded characters, allowed identifiers mustbe in the specified case folded form.

Note: For requirement UAX31-R7 with full case folding, filteringinvolves disallowing any characters in the set\p{Changes_When_Casefolded}.

As of Unicode 5.2, an additional string transform is available foruse in matching identifiers:toNFKC_Casefold(S).SeeR5 inSection 3.13, Default Case Algorithms in[Unicode]. That operationcase folds and normalizes a string, and also removes default-ignorable code points.It can be used to support an implementation ofUAX31-R4 andUAX31-R5Equivalent Case and Compatibility-Insensitive Identifiers.In order to implement requirementUAX31-R4, canonicaldecomposition must be applied prior to the toNFKC_Casefold operation.The resulting equivalence relation between identifiers is anidentifier caseless match,see definition D147 of [Unicode].There is a corresponding boolean property,Changes_When_NFKC_Casefolded, which can be used to support animplementation ofFiltered Case and Compatibility-InsensitiveIdentifiers. The NFKC_Casefold character mapping property and theChanges_When_NFKC_Casefolded property are described in UnicodeStandard Annex #44, "Unicode Character Database" [UAX44].

Note: In mathematically oriented programming languages thatmake distinctive use of the Mathematical Alphanumeric Symbols, suchas U+1D400 MATHEMATICAL BOLD CAPITAL A, an application of NFKC mustfilter characters to exclude characters with the property valueDecomposition_Type=Font.

5.1NFKCModifications

Where programming languages are using NFKC to fold differencesbetween characters, they need the following modifications of theidentifier syntax from the Unicode Standard to deal with theidiosyncrasies of a small number of characters. These modificationsare reflected in the XID_Start and XID_Continue properties.

5.1.1Modifications for Characters that Behave Like Combining Marks

Certain characters are not formally combining characters,although they behave in most respects as if they were. In most cases,the mismatch does not cause a problem, but when these characters havecompatibility decompositions, they can cause identifiers not to beclosed under Normalization Form KC. In particular, the following fourcharacters are included in XID_Continue and not XID_Start:

U+0E33 THAI CHARACTER SARA AM
U+0EB3 LAO VOWEL SIGN AM
U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

5.1.2Modifications for Irregularly Decomposing Characters

U+037A GREEK YPOGEGRAMMENI and certain Arabic presentationforms have irregular compatibility decompositions and are excludedfrom both XID_Start and XID_Continue. It is recommended that allArabic presentation forms be excluded from identifiers in any event,although only a few of them must be excluded for normalization toguarantee identifier closure.

5.1.3Identifier Closure Under Normalization

With these amendments to the identifier syntax, all identifiers areclosed under all four Normalization Forms. This means that for anystring S, the implications shown inFigure 5 hold.

Figure 5.Normalization Closure

isIdentifier(S) → isIdentifier(toNFD(S)) isIdentifier(toNFC(S)) isIdentifier(toNFKD(S)) isIdentifier(toNFKC(S))

Identifiers are also closed under case operations. For any string S(with exceptions involving a single character), the implicationsshown inFigure 6 hold.

Figure 6.Case Closure

isIdentifier(S) → isIdentifier(toLowercase(S)) isIdentifier(toUppercase(S)) isIdentifier(toFoldedcase(S))

The one exception for casing is U+0345 COMBINING GREEKYPOGEGRAMMENI. In the very unusual case that U+0345 is at the startof S, U+0345 is not in XID_Start, but its uppercase and case-foldedversions are. In practice, this is not a problem because of the waynormalization is used with identifiers.

The reverse implication is true for canonical equivalence butnottrue in the case of compatibility equivalence:

Figure 7.ReverseNormalization Closure

`isIdentifier(toNFD(S)) isIdentifier(toNFC(S))`	→ `isIdentifier(S)`
`isIdentifier(toNFKD(S)) isIdentifier(toNFKC(S))`	↛ `isIdentifier(S)`

There are many characters for which the reverse implication is nottrue for compatibility equivalence, because there are many characterscounting as symbols or non-decimal numbers—and thus outside ofidentifiers—whose compatibility equivalents are letters or decimalnumbers and thus in identifiers. Some examples are shown inTable8.

Table 8.Compatibility Equivalents to Letters or Decimal Numbers

Code Points	GC	Samples	Names
2070	No	⁰	SUPERSCRIPT ZERO
20A8	Sc	₨	RUPEE SIGN
2116	So	№	NUMERO SIGN
2120..2122	So	℠..™	SERVICE MARK..TRADE MARK SIGN
2460..2473	No	①..⑳	CIRCLED DIGIT ONE..CIRCLED NUMBER TWENTY
3300..33A6	So	㌀..㎦	SQUARE APAATO..SQUARE KM CUBED

If an implementation needs to ensure both directions forcompatibility equivalence of identifiers, then the identifierdefinition needs to be tailored to add these characters.

For canonical equivalence the implication is true in both directions.isIdentifier(toNFC(S))if and only ifisIdentifier(S).

There were two exceptions before Unicode 5.1, as shown inTable9. If an implementation needs to ensure full canonical equivalenceof identifiers, then the identifier definition must be tailored sothat these characters have the same value, so that either bothisIdentifier(S) and isIdentifier(toNFC(S)) are true, or so that bothvalues are false.

Table 9.Canonical Equivalence Exceptions Prior to Unicode 5.1

isIdentifier(toNFC(S))=True	isIdentifier(S)=False	Different in
02B9 ( ʹ ) MODIFIER LETTER PRIME	0374 ( ʹ ) GREEK NUMERAL SIGN	XID and ID
00B7 ( · ) MIDDLE DOT	0387 ( · ) GREEK ANO TELEIA	XID alone

Those programming languages with case-insensitive identifiers shoulduse the case foldings described inSection 3.13, Default CaseAlgorithms, of [Unicode]to produce a case-insensitive normalized form.

When source text is parsed for identifiers, the folding ofdistinctions (using case mapping or NFKC) must be delayed until afterparsing has located the identifiers. Thus such folding ofdistinctions should not be applied to string literals or to commentsin program source text.

The Unicode Standard supports case folding with normalization, withthe function toNFKC_Casefold(X). See definition R5 inSection3.13, Default Case Algorithms in [Unicode] for thespecification of this function and further explanation of its use.

5.2Caseand Stability

The alphabetic case of the initial character of an identifieris used as a mechanism to distinguish syntactic classes in somelanguages like Prolog, Erlang, Haskell, Clean, and Go. For example,in Prolog and Erlang, variables must begin with capital letters (orunderscores) and atoms must not. There are some complications in theuse of this mechanism.

For such a casing distinction in a programming language to workwith unicameral writing systems (such as Kanji or Devanagari),another mechanism (such as underscores) needs to substitute for thecasing distinction.

Casing stability is also an issue for bicameral writing systems. Theassignment of General_Category property values, such as gc=Lu, is notguaranteed to be stable, nor is the assignment of characters to thebroader properties such as Uppercase. So these property values cannotbe used by themselves, without incorporating a mechanism that preserves backward compatibility,such as is done for Unicode identifiers inSection2.5Backward Compatibility.That is, the implementation would maintain its own list of specialinclusions and exclusions that require updating for each new versionof Unicode.

Alternatively, a programming language specification can use theoperation specified inCaseFolding Stability as the basis for its casing distinction. Thatoperationis guaranteed to be stable. That is, one can use acasing distinction such as the following:

S is avariable if S begins with anunderscore.
Otherwise, produce S' = toCasefold(toNFKC(S))
1. S is avariable if firstCodePoint(S) ≠firstCodePoint(S'),
2. otherwise S is anatom.

This test can clearly be optimized for the normal cases, suchas initial ASCII. It is also recommended that identifiers be in NFKCformat, which makes the detection even simpler.

5.2.1EdgeCases for Folding

In Unicode 8.0, the Cherokee script letters have been changedfrom gc=Lo to gc=Lu, and corresponding lowercase letters (gc=Ll) havebeen added. This is an unusual pattern; typically when case pairs areadded, existing letters are changed from gc=Lo to gc=Ll, and newcorresponding uppercase letters (gc=Lu) are added. In the case ofCherokee, it was felt that this solution provided the mostcompatibility for existing implementations in terms of fonttreatment.

The downside of this approach is that the Cherokee characters,when case-folded, will convert as necessary to the pre-8.0characters, namely to the uppercase versions. This folding is unlikethat of any other case-mapped characters in Unicode. Thus thecase-folded version of a Cherokee string will contain uppercaseletters instead of lowercase letters. Compatibility with fonts forthe current user community was felt to be more important than theconfusion introduced by this edge case of case folding, becauseCherokee programmatic identifiers would be rare.

The upshot is that when it comes to identifiers,implementations should never use the General_Category or Lowercase orUppercase properties to test for casing conditions, nor usetoUppercase(), toLowercase(), or toTitlecase() to fold or testidentifiers. Instead, they should instead use Case_Folding orNFKC_CaseFold.

6HashtagIdentifiers

Hashtag identifiers have become very popular insocial media. They consist of a number sign in front of some stringof characters, such as #emoji. The actual composition of allowableUnicode hashtag identifiers varies between vendors. It has alsobecome common for hashtags to include emoji characters, without aclear notion of exactly which characters are included.

This section presents a syntax that can be usedfor parsing Unicode hashtag identifiers for increased interoperability.

UAX31-D2.DefaultHashtag Identifier Syntax:

<Hashtag-Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

When parsing hashtags in flowing text, it isrecommended that an extended Hashtag only be recognized when thereis no Continue character before a Start character. For example, in“abc#def” there would be no hashtag, while there would be in “abc#def” or “abc.#def”.

UAX31-R8.ExtendedHashtag Identifiers:To meet this requirement, to determine whethera string is a hashtag identifier an implementation shallchoose eitherUAX31-R8-1 orUAX31-R8-2.

UAX31-R8-1.Use definitionUAX31-D2, setting:

Start := [#﹟＃]
- U+0023 NUMBER SIGN
- U+FE5F SMALL NUMBER SIGN
- U+FF03 FULLWIDTH NUMBER SIGN
- (These are # and its compatibility equivalents.)
Medial is currently empty, but can be used for customization.
Continue := XID_Continue, plus Extended_Pictographic, Emoji_Component, and “_”, “-”, “+”, minus Start characters.
- Note the subtraction of # characters.
- This is expressed in UnicodeSet notation as:
  [\p{XID_Continue}\p{Extended_Pictographic}\p{Emoji_Component}[-+_]-[#﹟＃]]

UAX31-R8-2.Declare thatit uses aprofile ofUAX31-R8-1 as inUAX31-R1.

The emoji properties are from the corresponding version of [UTS51]. The version of the emoji properties is tied to the version of the Unicode Standard, starting with Version 11.0.

The techniques mentioned in Section 2.5Backward Compatibility may beused where stability between successive versions is required.

Comparison and matching should be done after converting to NFKC_CF format. Thus #MötleyCrüe should match #MÖTLEYCRÜEandother variants.

Implementations may choose to add characters inTable 3a,Optional Characters for Medial toMedial andTable 3b,Optional Characters for Continue toContinue for better identifiers for natural languages.

7Standard Profiles

Two standard profiles for default identifiers are provided to cater to common patterns of use observed in programming languages with less restrictive identifier syntaxes, including those that use UAX31-R2 default identifiers: the inclusion of characters suitable for mathematical usage in identifiers, and the inclusion of emoji in identifiers.

These profiles are associated with profiles for requirementsUAX31-R3b.

Further, a standard profile is provided to exclude default-ignorable code points from identifiers. Having no visible effect in most contexts, these characters can lead to spoofing issues; seeSection 2.3,Layout and Format Control Characters.

For guidance on the applicability of these profiles to programming languages, see Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

7.1Mathematical Compatibility Notation Profile

The Mathematical Compatibility Notation Profile for default identifiers consists of the addition of the set [:ID_Compat_Math_Start:] to the setStart, and the set [:ID_Compat_Math_Continue:] to the setContinue, in definitionUAX31-D1.

Note: The set [:ID_Compat_Math_Start:] comprises ∂, ∇, and their mathematical style variants, as well as ∞.The set [:ID_Compat_Math_Continue:] comprises [:ID_Compat_Math_Start:], as well as subscript and superscript digits and signs with mathematical use.

It is associated with a profile forUAX31-R3b, which consists of removing the characters in [[:Pattern_Syntax:] - [:ID_Compat_Math_Continue:]] from the set of characters with syntactic use (these are the characters ∂, ∇, and ∞).

Note: Whilesupporting these characters is recommended for some computer languages because they can be beneficial in some applications, these characters, like many others characters that are allowed in default identifiers, are discouraged in general use, as they are confusing to most readers. See Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

7.2Emoji Profile

The Emoji Profile for default identifiers provides for the inclusion of emoji characters and sequences in identifiers. A large subset of emoji are already supported in some programming languages, but this profile provides a mechanism for treating them consistently as part of the lexical structure of a language.

The Emoji Profile for default identifiers consists of:

The addition of the RGI emoji set defined by ED-27 in Unicode Technical Standard #51, “Unicode Emoji” [UTS51] for a given version of Unicode to the setsStart andContinue in definitionUAX31-D1.
The removal of the code point U+FE0E VARIATION SELECTOR-15 (the Text Presentation Selector) from the setContinue.

Note: The Emoji Profile requires the use of character sequences, rather than individual code points, in the setsStart andContinue defined byUAX31-D1. When using this profile, U+002A asterisk (*), U+203C double exclamation mark (‼), or U+263A white smiling face (☺) are not legal identifiers, but the sequences (U+002A, U+FE0F, U+20E3) *️⃣, (U+203C, U+FE0F) ‼️, and (U+263A, U+FE0F) ☺️ are allowed in identifiers. This would require some changes to lexers: when they hit a character that starts an emoji sequence they will (logically) switch to a different mechanism for parsing.

The Emoji Profile includes characters that are in Pattern_Syntax; it is therefore associated with a profile forUAX31-R3b, which consists of replacing each emoji character of a certain subset of [:Pattern_Syntax:] by itstext presentation sequence (ED-8a):

Remove the characters in the set [[:Pattern_Syntax:]&[:Emoji_Presentation:]] from the set of characters with syntactic use.
For all C in [[:Pattern_Syntax:]&[:Emoji_Presentation:]], add the sequence consisting of C followed by U+FE0E VARIATION SELECTOR-15 (the Text Presentation Selector) to the set of characters with syntactic use.

In addition, in order to avoid lexical ambiguities between identifiers and operators, the Emoji Profile includes a profile forUAX31-R3c, which consists of the removal of the character U+FE0F VARIATION SELECTOR-16 (the Emoji Presentation Selector) from the setContinue.

Example: Consider a language that meets requirementsUAX31-R3b andUAX31-R3c with no profile. U+2615 HOT BEVERAGE (☕) is a character with syntactic use, and therefore it is an operator. When meeting these requirements with the Emoji Profile, U+2615 HOT BEVERAGE (☕) is not a character with syntactic use (which allows it to be an identifier character) and ☕ is not a valid operator. However, the sequence U+2615 U+FE0F (☕︎) is added to the set of characters with syntactic use, and therefore ☕︎ is a valid operator.

This change means that if some of the Pattern_Syntax characters with the Emoji_Presentation property were in syntactic use (e.g., in operators) prior to adopting the Emoji Profile, they become identifiers once the profile is adopted, but can be turned back into operators by adding U+FE0E VARIATION SELECTOR-15, allowing for a migration path.

Of course, if a programming language only uses a subset of the Pattern_Syntax characters that does not include these characters, no action needs to be taken.

Some other characters in Pattern_Syntax (such as ↔) are used in emoji (such as ↔️), but they are not emoji on their own, so that they do not need to be removed from the set of characters with syntactic use as long as lexical analysis properly takes sequences into account.

The emoji sequences require 98 default-ignorable characters:

U+200D ZERO WIDTH JOINER (also known as ZWJ)
U+FE0F VARIATION SELECTOR-16 (also known as Emoji Presentation Selector)
U+E0020..U+E007F 96 TAG characters

Thus, if this profile is combined with any profile that removes default-ignorable characters, such as the Default-Ignorable Exclusion Profile, those characters need to be retained in the context of emoji sequences.

Consider the following examples, in a language that meets requirementUAX31-R1 with both the Emoji Profile and the Default Ignorable Exclusion Profile:

Sequence	Appearance	Legal Identifier?	Reason
A+ZWJ+B	A‍B	No	ZWJ is not part of an emoji sequence
U+1F408 + ZWJ + U+2B1B	🐈‍⬛	Yes	ZWJ is part of an emoji sequence(forblack cat)
BIG + U+1F408 + ZWJ + U+2B1B	BIG🐈‍⬛	Yes	ZWJ is part of an emoji sequence(forblack cat)

7.3Default-Ignorable Exclusion Profile

The default-ignorable exclusion profile for default identifiers consists of the exclusion of the code points with property Default_Ignorable_Code_Point from the setsStart andContinue in definitionUAX31-D1.

Note: While it reduces the attack surface, excluding default-ignorable code points does not prevent spoofing issues. More comprehensive mechanisms are described in Unicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39]; in particular, the exclusion of default-ignorable code points is part of the General for Profile for Identifiers.

Note: Where higher level diagnostics are available, such as in programming environments, more targeted measures can be taken in order to still allow for the legitimate use of these characters. See Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

Acknowledgments

Mark Davis is the author of the initial version and has addedto and maintained the text of this annex. Robin Leroy has assisted in updating it starting with Version 15.0.

The attendees of the Source Code Working Group meetings assisted with the substantial changes made in Versions 15.0 and 15.1:Peter Constable,Elnar Dakeshov,Mark Davis,Barry Dorrans,Steve Dower,Michael Fanning,Asmus Freytag,Dante Gagne,Rich Gillam,Manish Goregaokar,Tom Honermann,Jan Lahoda,Nathan Lawrence,Robin Leroy,Chris Ries,Markus Scherer,Richard Smith.

Thanks to Eric Muller, Asmus Freytag, Lisa Moore, Julie Allen, Jonathan Warden, KennethWhistler, David Corbett, Klaus Hartke, Martin Dürst, Deborah Anderson, Steve Downey, Ned Holbrook, Corentin Jabot, 梁海 Liang Hai, Jens Maurer, and Hubert Tong for feedback on this annex.

References

For references for this annex, see Unicode Standard Annex #41, “Common References for UnicodeStandard Annexes.”

Migration

Version 15.1

RequirementUAX31-R1a Restricted Format Characters has been withdrawn.

If implementations that claimed conformance to UAX31-R1a wish to retain the contextual checks for ZWJ and ZWNJ, they should refer to the General Security Profile in Unicode Technical Standard #39, “Unicode Security Mechanisms” [UTS39].

In previous versions, requirementUAX31-R3 Pattern_White_Space and Pattern_Syntax Characters did not require any particular interpretation of whitespace characters. It now specifies which characters are to be treated as line terminators, horizontal space, and ignorable format controls. The meaning of syntactic use has also been clarified.

Implementations that claim conformance to UAX31-R3 should check that they interpret the characters in Pattern_White_Space as described inUAX31-R3a Pattern_White_Space Characters, and that their use of Pattern_Syntax characters is consistent withUAX31-R3b Pattern_Syntax Characters.

Version 15.0

In previous versions, the note explaining how to implement requirementUAX31-R7 Filtered Case-Insensitive Identifiers with full case folding referred to the wrong property, and the requirement itself incorrectly refered to Normalization Form rather than case folded form.

Implementations that claim conformance to UAX31-R7 should check that they use the correct property.

Version 13.0

Version 13.0 changed the structure of Table 4.Excluded Scripts significantly, dropping conditions that were not based on script. Implementations that were based on Table 4 should refer toUTS #39, Unicode Security Mechanisms [UTS39] for additional restrictions.

Version 11.0

Version 11.0 refines the use of ZWJ in identifiers (adding some restrictions and relaxing others slightly), and broadens the definition of hashtag identifiers somewhat. For details, see theModifications.

Version 9.0

In previous versions, the text favored the useof XID_Start and XID_Continue, as in the following paragraph. However, the formal definition used ID_Start and ID_Continue.

The XID_Start and XID_Continue properties are improved lexicalclasses that incorporate the changes described inSection5.1,NFKC Modifications.They are recommended for most purposes, especially for security,over the original ID_Start and ID_Continue properties.

In version 9.0, that is swapped and the X versions arestated explicitly in the formal definition. This affects just thefollowing characters.

037A ; GREEK YPOGEGRAMMENI 0E33 ; THAI CHARACTER SARA AM 0EB3 ; LAO VOWEL SIGN AM 309B ; KATAKANA-HIRAGANA VOICEDSOUND MARK 309C ; KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK FC5E..FC63 ; ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATEDFORM FDFA ; ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM FDFB ; ARABIC LIGATURE JALLAJALALOUHOU FE70 ; ARABICFATHATAN ISOLATED FORM FE72 ; ARABIC DAMMATAN ISOLATEDFORM FE74 ; ARABIC KASRATAN ISOLATED FORM FE76 ;ARABIC FATHA ISOLATED FORM FE78 ; ARABIC DAMMA ISOLATEDFORM FE7A ; ARABIC KASRA ISOLATED FORM FE7C ;ARABIC SHADDA ISOLATED FORM FE7E ; ARABIC SUKUN ISOLATEDFORM FF9E ; HALFWIDTH KATAKANA VOICED SOUND MARK FF9F ; HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

Implementations that wish to maintainconformance to the older recommendation need only declare a profilethat uses ID_Start and ID_Continue instead of XID_Start and XID_Continue.

Version 9.0 splits the older Table 3 from Version 8.0 into 3parts.

Current Tables	Unicode 8.0
Table 3,Optional Characters for Start	Table 3, Candidate Characters for Inclusion in ID_Continue
Table 3a,Optional Characters for Medial
Table 3b,Optional Characters for Continue	only outlined in text

Version 6.1

Between Unicode Versions 5.2, 6.0 and 6.1, Table 5 was split inthree. In Version 6.1, the resulting tables were renumbered foreasier reference. The titles and links remain the same, forstability.

The following shows the correspondences:

Current Tables	Unicode 6.0	Unicode 5.2
Table 5,Recommended Scripts	5a	5
Table 6,Aspirational Use Scripts	5a
Table 7,Limited Use Scripts	5b
Table 8,Compatibility Equivalents to Letters or Decimal Numbers	6	6
Table 9,Canonical Equivalence Exceptions Prior to Unicode 5.1	7	7

Modifications

The following summarizes modifications from the previously published versionof this annex.

Revision 41

Reissued for Unicode 16.0.
Section 2.3,Layout and Format Control Characters,andSection 5,Normalization and Case: clarified thatNFD must be applied before toNFKC_Casefold in order to correctly meetrequirements UAX31-R4 and UAX-R5 with NFKC and full case folding, and addeda reference to definition D147 of the Unicode Standard.
Section 2.4,Specific Character Adjustments,Removed the suggestion to add MIDDLE DOT to as part of a profile:it is already part of default identifiers with no profile since Unicode Version 5.1.
Section 2.5,Backward Compatibility:corrected the inappropriately-normalized occurrence of U+0387 in the list of Other_ID_Continue characters.

Modifications for previous versions are listed in those respective versions.

© 2005–2024 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by theTerms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the UnicodeTerms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.

Movatterモバイル変換