Unicode® Standard Annex #31

Unicode Identifier and Pattern Syntax

Version	Unicode 8.0.0
Editors	MarkDavis (markdavis@google.com)
Date	2015-06-01
This Version	http://www.unicode.org/reports/tr31/tr31-23.html
Previous Version	http://www.unicode.org/reports/tr31/tr31-21.html
Latest Version	http://www.unicode.org/reports/tr31/
Latest Proposed Update	http://www.unicode.org/reports/tr31/proposed.html
Revision	23

Summary

This annex describes specifications for recommended defaultsfor the use of Unicode in the definitions of identifiers and inpattern-based syntax. It also supplies guidelines for use ofnormalization with identifiers.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral partof the Unicode Standard, but is published online as a separatedocument. The Unicode Standard may require conformance to normativecontent in a Unicode Standard Annex, if so specified in theConformance chapter of that version of the Unicode Standard. Theversion number of a UAX document corresponds to the version of theUnicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

1Introduction

A common task facing an implementer of the Unicode Standard isthe provision of a parsing and/or lexing engine for identifiers, suchas programming language variables or domain names. To assist in thestandard treatment of identifiers in Unicode character-based parsersand lexical analyzers, a set of specifications is provided here as arecommended default for the definition of identifier syntax.

These guidelines follow the typical pattern of identifiersyntax rules in common programming languages, by defining an ID_Startclass and an ID_Continue class and using a simple BNF rule foridentifiers based on those classes; however, the composition of thoseclasses is more complex and contains additional types of characters,due to the universal scope of the Unicode Standard.

This annex also provides guidelines for the user of normalization andcase insensitivity with identifiers, expanding on a section that wasoriginally in Unicode Standard Annex #15, “Unicode NormalizationForms” [UAX15].

The specification in this annex provides a definition of identifiersthat is guaranteed to be backward compatible with each successiverelease of Unicode, but also allows any appropriate new Unicodecharacters to become available in identifiers. In addition, Unicodecharacter properties for stable pattern syntax are provided. Theresulting pattern syntax is backward compatibleand forwardcompatible over future versions of the Unicode Standard. Theseproperties can either be used alone or in conjunction with theidentifier characters.

Figure 1 shows the disjoint categories of code points definedin this annex. (The sizes of the boxes are not to scale.)

Figure 1.Code Point Categories for Identifier Parsing

Pattern_Syntax
CharactersUnassigned Code PointsPattern_White_Space
Characters

ID_Start Characters
ID_Nonstart Characters
Other Assigned Code Points

The set consisting of the union ofID_Start andID_Nonstartcharacters is known asIdentifier Characters and has thepropertyID_Continue. TheID_Nonstart set is defined asthe set differenceID_Continue minusID_Start: it isnot a formal Unicode property. While lexical rules are traditionallyexpressed in terms of the latter, the discussion here is simplifiedby referring to disjoint categories.

1.1Stability

There are certain features that developers can depend on forstability:

Identifier characters, Pattern_Syntax characters, andPattern_White_Space are disjoint: they will never overlap.
The Identifier characters are always a superset of theID_Start characters.
The Pattern_Syntax characters and Pattern_White_Spacecharacters are immutable and will not change over successiveversions of Unicode.
The ID_Start and ID_Nonstart characters may grow over time,either by the addition of new characters provided in a futureversion of Unicode or (in rare cases) by the addition of charactersthat were in Other.

In successive versions of Unicode, the only allowed changes ofcharacters from one of the above classes to another are those listedwith a plus sign (+) inTable 1.

Table 1.Permitted Changes in Future Versions

	ID_Start	ID_Nonstart	Other Assigned
Unassigned
Other Assigned
ID_Nonstart

The Unicode Consortium has formally adopted a stability policy onidentifiers. For more information, see [Stability].

1.2Customization

Each programming language standard has its own identifiersyntax; different programming languages have different conventionsfor the use of certain characters such as $, @, #, and _ inidentifiers. To extend such a syntax to cover the full behavior of aUnicode implementation, implementers may combine those specific ruleswith the syntax and properties provided here.

Each programming language can define its identifier syntax asrelativeto the Unicode identifier syntax, such as saying that identifiers aredefined by the Unicode properties, with the addition of “$”. Byaddition or subtraction of a small set of language specificcharacters, a programming language standard can easily track agrowing repertoire of Unicode characters in a compatible way. SeealsoSection 2.5,BackwardCompatibility.

Similarly, each programming language can define its ownwhitespace characters or syntax characters relative to the UnicodePattern_White_Space or Pattern_Syntax characters, with some specifiedset of additions or subtractions.

Systems that want to extend identifiers to encompass words used innatural languages, or narrow identifiers for security may do so asdescribed inSection 2.3,Layout and FormatControl Characters,Section 2.4,Specific CharacterAdjustments, andSection 5,Normalization and Case.

To preserve the disjoint nature of the categories illustrated inFigure1, any characteradded to one of the categories must besubtractedfrom the others.

Note: In many cases there are importantsecurity implications that may require additional constraints onidentifiers. For more information, see [UTR36].

1.3Display Format

Implementations may use a format fordisplaying identifiersthat differs from the internal form used tocompareidentifiers. For example, an implementation might display format whatthe user has entered, but use a normalized format for comparison.Examples of this include:

Case.The display format retains case differences,but the comparison format erases them by using Case_Folding. Thus“A” and its lowercase variant “a” would be treated as the sameidentifier internally, even though they may have been inputdifferently and may display differently.
Variants.The display format retains variantdistinctions, such as halfwidth versus fullwidth forms, or betweenvariation sequences and their base characters, but the comparisonformat erases them by using NFKC_Case_Folding. Thus “A” and itsfull-width variant “Ａ” would be treated as the same identifierinternally, even though they may have been input differently and maydisplay differently.

For an example of the use of display versus comparison formats seeUTS#46: Unicode IDNA Compatibility Processing [UTS46]. For more informationabout normalization and case in identifiers seeSection 5,Normalization and Case.

1.4Conformance

The following describes the possible ways that animplementation can claim conformance to this specification.

UAX31-C1. An implementation claiming conformance tothis specification shall identify the version of thisspecification.

UAX31-C2. An implementation claiming conformance tothis specification shall describe which of the followingrequirements it observes:

2Default Identifier Syntax

The formal syntax provided here captures the general intentthat an identifier consists of a string of characters beginning witha letter or an ideograph, and followed by any number of letters,ideographs, digits, or underscores. It provides a definition ofidentifiers that is guaranteed to be backward compatible with eachsuccessive release of Unicode, but also adds any appropriate newUnicode characters.

UAX31-D1.Default Identifier Syntax:

<identifier> := <ID_Start><ID_Continue>*

Identifiers are defined by the sets of lexical classes defined asproperties in the Unicode Character Database. These properties areshown inTable 2.

Table 2.LexicalClasses for Identifiers

Properties Alternates General Description of Coverage

Properties	Alternates	General Description of Coverage
`ID_Start`	`XID_Start`	Characters having the Unicode General_Category ofuppercase letters (Lu), lowercase letters (Ll), titlecase letters(Lt), modifier letters (Lm), other letters (Lo), letter numbers(Nl), plus Other_ID_Start (stability extensions), minusPattern_Syntax and Pattern_White_Space code points. Note that“other letters” includes ideographs. For more about the stabilityextensions, seeSection 2.5Backward Compatibility. In set notation:[[:L:][:Nl:][:Other_ID_Start:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]
`ID_Continue`	`XID_Continue`	All of the above, plus characters having the UnicodeGeneral_Category of nonspacing marks (Mn), spacing combining marks(Mc), decimal number (Nd), connector punctuations (Pc), plusOther_ID_Continue (stability extensions), minus Pattern_Syntax andPattern_White_Space code points. For more about the stabilityextensions, seeSection 2.5Backward Compatibility. In set notation:[[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:][:Other_ID_Continue:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]] These are also known simply asIdentifier Characters,because they are a superset of the`ID_Start`characters.

ID_Start

XID_Start

Characters having the Unicode General_Category ofuppercase letters (Lu), lowercase letters (Ll), titlecase letters(Lt), modifier letters (Lm), other letters (Lo), letter numbers(Nl), plus Other_ID_Start (stability extensions), minusPattern_Syntax and Pattern_White_Space code points. Note that“other letters” includes ideographs. For more about the stabilityextensions, seeSection 2.5Backward Compatibility.

In set notation:[[:L:][:Nl:][:Other_ID_Start:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]

ID_Continue

XID_Continue

All of the above, plus characters having the UnicodeGeneral_Category of nonspacing marks (Mn), spacing combining marks(Mc), decimal number (Nd), connector punctuations (Pc), plusOther_ID_Continue (stability extensions), minus Pattern_Syntax andPattern_White_Space code points. For more about the stabilityextensions, seeSection 2.5Backward Compatibility.

In set notation:[[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:][:Other_ID_Continue:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]

These are also known simply asIdentifier Characters,because they are a superset of theID_Startcharacters.

The innovations in the identifier syntax to cover the UnicodeStandard include the following:

Incorporation of proper handling of combining marks.
Allowance for layout and format control characters, whichshould be ignored when parsing identifiers.

The XID_Start and XID_Continue properties are improvedlexical classes that incorporate the changes described inSection5.1,NFKC Modifications.They are recommended for most purposes, especially for security,over the original ID_Start and ID_Continue properties.

UAX31-R1.Default Identifiers:To meet this requirement, animplementation shall use definitionD1 and the properties ID_Startand ID_Continue (or XID_Start and XID_Continue) to determinewhether a string is an identifier. Alternatively, it shall declare that it uses aprofile anddefine that profile with a precise specification of the charactersthat are added to or removed from the above properties and/orprovide a list of additional constraints on identifiers.

UAX31-R1a.Restricted Format Characters: To meet this requirement, animplementation shall define a profile forR1 which allows formatcharacters as described in Section 2.3,Layout and FormatControl Characters. An implementation may further restrict the context for ZWJ or ZWNJ,such as by limiting the scripts, if a clear specification for sucha further restriction is supplied.

UAX31-R1b.Stable Identifiers: To meet this requirement, animplementation shall guarantee that identifiers are stable acrossversions of the Unicode Standard: that is, once a string qualifiesas an identifier, it does so in all future versions.

Note: The R1b requirement is typically achieved by using grandfathered characters. See Section 2.5,Backward Compatibility.

2.1CombiningMarks

Combining marks are accounted for in identifier syntax: a composedcharacter sequence consisting of a base character followed by anynumber of combining marks is valid in an identifier. Combining marksare required in the representation of many languages, and theconformance rules inChapter 3, Conformance, of [Unicode] require theinterpretation of canonical-equivalent character sequences. Thesimplest way to do this is to require identifiers in the NFC format(or transform them into that format); seeSection 5,Normalization and Case.

Enclosing combining marks (such as U+20DD..U+20E0) are excluded fromthe definition of thelexical classID_Continue, because the composite characters that result from their compositionwith letters are themselves not normally considered validconstituents of these identifiers.

2.2ModifierLetters

Modifier letters (General_Category=Lm) are also included in thedefinition of the syntax classes for identifiers. Modifier lettersare often part of natural language orthographies and are useful formaking word-like identifiers in formal languages. On the other hand,modifier symbols (General_Category=Sk), which are seldom a part oflanguage orthographies, are excluded from identifiers. For morediscussion of modifier letters and how they function, see [Unicode].

Implementations that tailor identifier syntax for specialpurposes may wish to take special note of modifier letters, as insome cases modifier letters have appearances, such as raised commas,which may be confused with common syntax characters such as quotationmarks.

2.3Layout and FormatControl Characters

Certain Unicode characters are known asDefault_Ignorable_Code_Points. These include variation selectors andcharacters used to control joining behavior, bidirectional orderingcontrol, and alternative formats for display (having theGeneral_Category value of Cf). The recommendation is to permit themin identifiers only in special cases, listed below. The use ofdefault-ignorable characters in identifiers is problematical, firstbecause the effects they represent are stylistic or otherwise out ofscope for identifiers, and second because the characters themselvesoften have no visible display. It is also possible to misapply thesecharacters such that users can create strings that look the same butactually contain different characters, which can create securityproblems. In such environments, identifiers should also be limited tocharacters that are case-folded and normalized with the NFKC_Casefoldoperation. For more information, seeSection 5,Normalization and Case andUTR#36: Unicode Security Considerations [UTR36].

Variation selectors, in particular, including standardized variantsand sequences from the Ideographic Variation Database, are notincluded in the default identifier syntax. These are subject to thesame considerations as for other Default_Ignorable_Code_Points listedabove. Because variation selectors request a difference in displaybut do not guarantee it, they do not work well in general-purposeidentifiers. The NFKC_Casefold operation can be used to remove them,along with other Default_Ignorable_Code_Points. However, in someenvironments it may be useful to retain variation sequences in thedisplay form for identifiers. For more information, seeSection1.3,Display Format.

For the above reasons, default-ignorable characters are normallyexcluded from Unicode identifiers. However, visible distinctionscreated by certain format characters (particularly theJoin_Controlcharacters) are necessary in certain languages. A blanket exclusionof these characters makes it impossible to create identifiers withthe correct visual appearance for common words or phrases in thoselanguages.

Identifier systems that attempt to provide more naturalrepresentations of terms in "modern, customary usage"should allow these characters in input and display, but limit them tocontexts in which they are necessary. The termmoderncustomary usage includes characters that are in common use innewspapers, journals, lay publications; on street signs; incommercial signage; and as part of common geographic names andcompany names, and so on. It does not include technical or academicusage such as in mathematical expressions, using archaic scripts orwords, or pedagogical use (such as illustration of half-forms orjoining forms in isolation), or liturgical use.

The goals for such a restriction of format characters toparticular contexts are to:

Allow the use of these characters where required in normaltext
Exclude as many cases as possible where no visibledistinction results
Be simple enough to be easily implemented with standardmechanisms such as regular expressions

Thus in such circumstances, an implementation should allow thefollowing Join_Control characters in the limited contexts specifiedinA1,A2, andB below.

U+200C ZERO WIDTH NON-JOINER (ZWNJ)
U+200D ZERO WIDTH JOINER (ZWJ)

There are also two global conditions incorporated in each ofA1,A2, andB:

Script Restriction. In each of the following cases,the specified sequence must only consist of characters from a singlescript (after ignoringCommon andInherited scriptcharacters).
Normalization.In each of the following cases, thespecified sequence must be in NFC format. (To test an identifierthat is not required to be in NFC, first transform into NFC formatand then test the condition.)

A1. Allow ZWNJ in thefollowing context:

Breaking a cursive connection. That is, in the context basedon the Joining_Type property, consisting of:

A Left-Joining or Dual-Joining character, followed by zeroor more Transparent characters, followed by a ZWNJ, followed by zeroor more Transparent characters, followed by a Right-Joining orDual-Joining character

This corresponds to the following regular expression (in Perl-stylesyntax):/$LJ ZWNJ $RJ/
where:

$T = [:Joining_Type=Transparent:]
$RJ =[[:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:]]
$LJ = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]

For example, consider Farsi <Noon, Alef, Meem, Heh, Alef,Farsi Yeh>. Without a ZWNJ, it translates to "names",as shown in the first row; with a ZWNJ between Heh and Alef, it means"a letter", as shown in the second row ofFigure 2.

Figure 2.Persian Example with ZWNJ

Appearance	Code Points	Abbreviated Names
	0646 + 0627 + 0645 + 0647 +0627 + 06CC	NOON + ALEF + MEEM + HEH + ALEF+ FARSI YEH
	0646 + 0627 + 0645 + 0647 +200C + 0627 + 06CC	NOON + ALEF + MEEM + HEH + ZWNJ+ ALEF + FARSI YEH

A2. Allow ZWNJ in thefollowing context:

In a conjunct context. That is, a sequence of the form:

A Letter, followed by a Virama, followed by a ZWNJ

This corresponds to the following regular expression (in Perl-stylesyntax):/$L $V ZWNJ/
where:

$L = [:General_Category=Letter:]
$V =[:Canonical_Combining_Class=Virama:]

For example, the Malayalam word foreyewitness is shown inFigure3. The form without the ZWNJ in the second row is incorrect in thiscase.

Figure 3.Malayalam Example with ZWNJ

Appearance	Code Points	Abbreviated Names
	0D26 + 0D43 + 0D15 + 0D4D +200C + 0D38 + 0D3E + 0D15 + 0D4D + 0D37 + 0D3F	DA + VOWEL SIGN VOCALIC R + KA+ VIRAMA + ZWNJ + SA + VOWEL SIGN AA + KA + VIRAMA + SSA + VOWELSIGN I
	0D26 + 0D43 + 0D15 + 0D4D +0D38 + 0D3E + 0D15 + 0D4D + 0D37 + 0D3F	DA + VOWEL SIGN VOCALIC R + KA+ VIRAMA + SA + VOWEL SIGN AA + KA + VIRAMA + SSA + VOWEL SIGN I

B. Allow ZWJ in thefollowing context:

In a conjunct context.That is, a sequence of the form:

A Letter, followed by a Virama, followed by a ZWJ

This corresponds to the following regular expression (in Perl-stylesyntax): /$L $V ZWJ/
where:

$L= [:General_Category=Letter:]
$V =[:Canonical_Combining_Class=Virama:]

For example, the Sinhala word for the country 'Sri Lanka' isshown in the first row ofFigure 4, which uses both a spacecharacter and a ZWJ. Removing the space results in the text shown inthe second row ofFigure 4, which is still legible, butremoving the ZWJ completely modifies the appearance of the'Sri' cluster and results in the unacceptable text appearanceshown in the third row ofFigure 4.

Figure 4.Sinhala Example with ZWJ

Appearance	Code Points	Abbreviated Names
	0DC1 + 0DCA + 200D + 0DBB +0DD3 + 0020 + 0DBD + 0D82 + 0D9A + 0DCF	SHA + VIRAMA + ZWJ + RA + VOWELSIGN II + SPACE + LA + ANUSVARA + KA + VOWEL SIGN AA
	0DC1 + 0DCA + 200D + 0DBB +0DD3 + 0DBD + 0D82 + 0D9A + 0DCF	SHA + VIRAMA + ZWJ + RA + VOWELSIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA
	0DC1 + 0DCA + 0DBB + 0DD3 +0020 + 0DBD + 0D82 + 0D9A + 0DCF	SHA + VIRAMA + RA + VOWEL SIGNII + SPACE + LA + ANUSVARA + KA + VOWEL SIGN AA

2.3.1Limitations

While the restrictions inA1,A2, andB greatlylimit visual confusability, they do not prevent it. For example,because Tamil only uses a Join_Control character in one specificcase, most of the sequences these rules allow in Tamil are, in fact,visually confusable. Therefore based on their knowledge of the scriptconcerned, implementations may choose to have tighter restrictionsthan specified below. There are also cases where a joiner preceding avirama makes a visual distinction in some scripts. It is currentlyunclear whether this distinction is important enough in identifiersto warrant retention of a joiner. For more information, see UTR #36:Unicode Security Considerations [UTR36].

Performance. Parsing identifiers can be aperformance-sensitive task. However, these characters are quite rarein practice, thus the regular expressions (or equivalent processing)only rarely would need to be invoked. Thus these tests should not addany significant performance cost overall.

Comparison. Typically the identifiers with andwithout these characters should compare as equivalent, to preventsecurity issues. SeeSection 2.4,Specific CharacterAdjustments.

2.4Specific CharacterAdjustments

Specific identifier syntaxes can be treated as tailorings (orprofiles)of the generic syntax based on character properties. For example, SQLidentifiers allow an underscore as an identifier continue, but not asan identifier start; C identifiers allow an underscore as either anidentifier continue or an identifier start. Specific languages mayalso want to exclude the characters that have a Decomposition_Typeother than Canonical or None, or to exclude some subset of those,such as those with a Decomposition_Type equal to Font.

There are circumstances in which identifiers are expected to morefully encompass words or phrases used in natural languages. Forexample, it is recommended that U+00B7 (·) MIDDLE DOT be allowed inmedial positions in natural-language identifiers such as hashtags orsearch terms, because it is required for grammatical Catalan. Formore natural-language identifiers, a profile should allow thecharacters inTable3in identifiers, unless there are compelling reasons not to.

For related issues about MIDDLE DOT, seeSection 5,Normalization and Case. Note in particular that U+0387 ( · ) GREEKANO TELEIA has a singleton decomposition to U+00B7 ( · ) MIDDLE DOT.Because MIDDLE DOT is not needed as a trailing character in Catalan,some implementations may restrict the use of U+00B7 ( · ) MIDDLE DOTto only the interior of identifers to prevent a trailing U+0387 /U+00B7 from being included in identifers. Thus the syntax would looklike the following (where<ID_Internal> includedMIDDLE DOT):

<identifier> := <ID_Start><ID_Continue>* (<ID_Internal> <ID_Continue>+)*

In some environments even spaces are allowed in identifiers, such asin SQL:SELECT * FROM Employee Pension.

Table 3.Candidate Characters for Inclusion in ID_Continue

Code Point	Glyph	Name
0027	'	APOSTROPHE
002D	-	HYPHEN-MINUS
002E	.	FULL STOP
003A	:	COLON
00B7	·	MIDDLE DOT
058A	֊	ARMENIAN HYPHEN
05F3	׳	HEBREW PUNCTUATION GERESH
05F4	״	HEBREW PUNCTUATION GERSHAYIM
0F0B	་	TIBETAN MARK INTERSYLLABIC TSHEG
200C		ZERO WIDTH NON-JOINER*
200D		ZERO WIDTH JOINER*
2010	‐	HYPHEN
2019	’	RIGHT SINGLE QUOTATION MARK
2027	‧	HYPHENATION POINT
30A0	゠	KATAKANA-HIRAGANA DOUBLE HYPHEN
30FB	・	KATAKANA MIDDLE DOT

The characters marked with an asterisk inTable 3 areJoin_Control characters, discussed inSection 2.3,Layout and FormatControl Characters.

In UnicodeSet syntax, the characters in Table 3 are:

[\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB]

In identifiers that allow for unnormalized characters, thecompatibility equivalents of the characters listed inTable 3may also be appropriate.

For more information on characters that may occur in words, and thosethat may be used in name validation, seeSection4, Word Boundaries, in [UAX29].

Some characters are not in modern customary use, and thusimplementations may want to exclude them from identifiers. Theseinclude characters in historic and obsolete scripts, scripts usedmostly liturgically, and regional scripts used only in very smallcommunities or with very limited current usage. The set of charactersinTable4, Candidate Characters for Exclusion from Identifiers providescandidates of these, plus some inappropriate technical blocks.

Table 4.Candidate Characters for Exclusion from Identifiers

Property Notation	Description
`[:script=Aghb:]`	Caucasian Albanian
`[:script=Ahom:]`	Ahom
`[:script=Armi:]`	Imperial Aramaic
`[:script=Avst:]`	Avestan
`[:script=Bass:]`	Bassa Vah
`[:script=Brah:]`	Brahmi
`[:script=Bugi:]`	Buginese
`[:script=Buhd:]`	Buhid
`[:script=Cari:]`	Carian
`[:script=Copt:]`	Coptic
`[:script=Cprt:]`	Cypriot
`[:script=Dsrt:]`	Deseret
`[:script=Dupl:]`	Duployan
`[:script=Egyp:]`	Egyptian Hieroglyphs
`[:script=Elba:]`	Elbasan
`[:script=Glag:]`	Glagolitic
`[:script=Goth:]`	Gothic
`[:script=Gran:]`	Grantha
`[:script=Hano:]`	Hanunoo
`[:script=Hatr:]`	Hatran
`[:script=Hluw:]`	Anatolian Hieroglyphs
`[:script=Hmng:]`	Pahawh Hmong
`[:script=Hung:]`	Old Hungarian
`[:script=Ital:]`	Old Italic
`[:script=Khar:]`	Kharoshthi
`[:script=Khoj:]`	Khojki
`[:script=Kthi:]`	Kaithi
`[:script=Lina:]`	Linear A
`[:script=Linb:]`	Linear B
`[:script=Lyci:]`	Lycian
`[:script=Lydi:]`	Lydian
`[:script=Mahj:]`	Mahajani
`[:script=Mani:]`	Manichaean
`[:script=Mend:]`	Mende Kikakui
`[:script=Mero:]`	Meroitic Hieroglyphs
`[:script=Merc:]`	Meroitic Cursive
`[:script=Modi:]`	Modi
`[:script=Mroo:]`	Mro
`[:script=Mult:]`	Multani
`[:script=Narb:]`	Old North Arabian
`[:script=Nbat:]`	Nabataean
`[:script=Ogam:]`	Ogham
`[:script=Orkh:]`	Old Turkic
`[:script=Osma:]`	Osmanya
`[:script=Palm:]`	Palmyrene
`[:script=Pauc:]`	Pau Cin Hau
`[:script=Perm:]`	Old Permic
`[:script=Phag:]`	Phags Pa
`[:script=Phlp:]`	Psalter Pahlavi
`[:script=Phli:]`	Inscriptional Pahlavi
`[:script=Phnx:]`	Phoenician
`[:script=Prti:]`	Inscriptional Parthian
`[:script=Rjng:]`	Rejang
`[:script=Runr:]`	Runic
`[:script=Samr:]`	Samaritan
`[:script=Sarb:]`	Old South Arabian
`[:script=Sgnw:]`	SignWriting
`[:script=Shrd:]`	Sharada
`[:script=Shaw:]`	Shavian
`[:script=Sidd:]`	Siddham
`[:script=Sind:]`	Khudawadi
`[:script=Sora:]`	Sora Sompeng
`[:script=Tagb:]`	Tagbanwa
`[:script=Tglg:]`	Tagalog
`[:script=Takr:]`	Takri
`[:script=Tirh:]`	Tirhuta
`[:script=Ugar:]`	Ugaritic
`[:script=Wara:]`	Warang Citi
`[:script=Xpeo:]`	Old Persian
`[:script=Xsux:]`	Cuneiform
`[[:Extender=True:] & [:Joining_Type=Join_Causing:]]`	0640 ( ‎ـ‎ ) ARABIC TATWEEL 07FA ( ‎ߺ‎ ) NKO LAJANYALAN
`[:Default_Ignorable_Code_Point:]`	Default Ignorable Code Points SeeSection2.3,Layoutand Format Control Characters
`[:block=Combining_Diacritical_Marks_for_Symbols:] [:block=Musical_Symbols:] [:block=Ancient_Greek_Musical_Notation:] [:block=Phaistos_Disc:]`

The scripts listed inTable5, Recommended Scripts are generally recommended for use inidentifiers. These are in widespread modern customary use, or areregional scripts in modern customary use by large communities.

Table 5.Recommended Scripts

Property Notation	Description
`[:script=Zyyy:]`	Common
`[:script=Zinh:]`	Inherited
`[:script=Arab:]`	Arabic
`[:script=Armn:]`	Armenian
`[:script=Beng:]`	Bengali
`[:script=Bopo:]`	Bopomofo
`[:script=Cyrl:]`	Cyrillic
`[:script=Deva:]`	Devanagari
`[:script=Ethi:]`	Ethiopic
`[:script=Geor:]`	Georgian
`[:script=Grek:]`	Greek
`[:script=Gujr:]`	Gujarati
`[:script=Guru:]`	Gurmukhi
`[:script=Hani:]`	Han
`[:script=Hang:]`	Hangul
`[:script=Hebr:]`	Hebrew
`[:script=Hira:]`	Hiragana
`[:script=Knda:]`	Kannada
`[:script=Kana:]`	Katakana
`[:script=Khmr:]`	Khmer
`[:script=Laoo:]`	Lao
`[:script=Latn:]`	Latin
`[:script=Mlym:]`	Malayalam
`[:script=Mymr:]`	Myanmar
`[:script=Orya:]`	Oriya
`[:script=Sinh:]`	Sinhala
`[:script=Taml:]`	Tamil
`[:script=Telu:]`	Telugu
`[:script=Thaa:]`	Thaana
`[:script=Thai:]`	Thai
`[:script=Tibt:]`	Tibetan

The scripts listed inTable6, Aspirational Use Scripts are scripts would otherwise qualify asLimited Use (seeTable7, Limited Use Scripts), but have strong current efforts toincrease their usage.

Table 6.Aspirational Use Scripts

Property Notation	Description
`[:script=Cans:]`	Canadian_Aboriginal
`[:script=Plrd:]`	Miao
`[:script=Mong:]`	Mongolian
`[:script=Tfng:]`	Tifinagh
`[:script=Yiii:]`	Yi

Modern scripts that are in more limited use are listed inTable 7, Limited Use Scripts.To avoid security issues, some implementations may wish to disallowthe limited-use scripts in identifiers. For more information onusage, see the Unicode Locale project [CLDR].

Table 7.Limited Use Scripts

Property Notation	Description
`[:script=Bali:]`	Balinese
`[:script=Bamu:]`	Bamum
`[:script=Batk:]`	Batak
`[:script=Cakm:]`	Chakma
`[:script=Cham:]`	Cham
`[:script=Cher:]`	Cherokee
`[:script=Java:]`	Javanese
`[:script=Kali:]`	Kayah_Li
`[:script=Lana:]`	Tai Tham
`[:script=Lepc:]`	Lepcha
`[:script=Limb:]`	Limbu
`[:script=Lisu:]`	Lisu
`[:script=Mand:]`	Mandaic
`[:script=Mtei:]`	Meetei Mayek
`[:script=Nkoo:]`	Nko
`[:script=Olck:]`	Ol Chiki
`[:script=Saur:]`	Saurashtra
`[:script=Sund:]`	Sundanese
`[:script=Sylo:]`	Syloti_Nagri
`[:script=Syrc:]`	Syriac
`[:script=Tale:]`	Tai Le
`[:script=Talu:]`	New Tai Lue
`[:script=Tavt:]`	Tai Viet
`[:script=Vaii:]`	Vai

This is the recommendation as of the current version of Unicode; asnew scripts are added to future versions of Unicode, characters maybe added to Tables4,5,6, and7. Characters may also bemoved from one table to another as more information becomesavailable.

There are a few special cases:

The Common and Inherited script values[[:script=Zyyy:][:script=Zinh:]] are used widely with other scripts,rather than being scripts per se. See also the Script_Extensionsproperty in the Unicode Character Database [UAX44].
The Unknown script [:script=Zzzz:] is used for Unassignedcharacters.
Braille [:script=Brai:] consists only of symbols
Katakana_Or_Hiragana [:script=Hrkt:] is empty (usedhistorically in Unicode, but no longer.)
With respect to the scripts Balinese, Cham, Ol Chiki, Vai,Kayah Li, and Saurashtra, there may be large communities of peoplespeaking an associated language, but the script itself is not not inwidespread use. However, there are significant revival efforts.
Bopomofo is used primarily in education.

For programming language identifiers, normalization and case have anumber of important implications. For a discussion of these issues,seeSection 5,Normalizationand Case.

2.5BackwardCompatibility

Unicode General_Category values are kept as stable as possible, butthey can change across versions of the Unicode Standard. The bulk ofthe characters having a given value are determined by otherproperties, and the coverage expands in the future according to theassignment of those properties. In addition, the Other_ID_Startproperty provides a small list of characters that qualified asID_Start characters in some previous version of Unicode solely on thebasis of their General_Category properties, but that no longerqualify in the current version. These are calledgrandfatheredcharacters.

The Other_ID_Start property includes characters such as thefollowing:

U+2118 ( ℘ ) SCRIPT CAPITAL P
U+212E ( ℮ ) ESTIMATED SYMBOL
U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Similarly, the Other_ID_Continue property adds a small list ofcharacters that qualified as ID_Continue characters in some previousversion of Unicode solely on the basis of their General_Categoryproperties, but that no longer qualify in the current version.

The Other_ID_Continue property includes characters such as thefollowing:

U+1369 ETHIOPIC DIGIT ONE...U+1371 ETHIOPIC DIGIT NINE
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+19DA ( ᧚ ) NEW TAI LUE THAM DIGIT ONE

The exact list of characters covered by the Other_ID_Start andOther_ID_Continue properties depends on the version of Unicode. Formore information, see Unicode Standard Annex #44, “Unicode CharacterDatabase” [UAX44].

The Other_ID_Start and Other_ID_Continue properties are thusdesigned to ensure that the Unicode identifier specification isbackward compatible. Any sequence of characters that qualified as anidentifier in some version of Unicode will continue to qualify as anidentifier in future versions.

If a specification tailors the Unicode recommendations foridentifiers, then this technique can also be used to maintainbackwards compatibility across versions.

3Alternative Identifier Syntax

The disadvantage of working with the lexical classes definedpreviously is the storage space needed for the detailed definitions,plus the fact that with each new version of the Unicode Standard newcharacters are added, which an existing parser would not be able torecognize. In other words, the recommendations based on that tableare not upwardly compatible.

This problem can be addressed by turning the question around.Instead of defining the set of code points that are allowed, define asmall, fixed set of code points that are reserved for syntactic useand allow everything else (including unassigned code points) as partof an identifier. All parsers written to this specification wouldbehave the same way for all versions of the Unicode Standard, becausethe classification of code points is fixed forever.

The drawback of this method is that it allows “nonsense” to be partof identifiers because the concerns of lexical classification and ofhuman intelligibility are separated. Human intelligibility can,however, be addressed by other means, such as usage guidelines thatencourage a restriction to meaningful terms for identifiers. For anexample of such guidelines, see the XML specification by the W3C,Version 1.0 5th Edition or later [XML].

By increasing the set of disallowed characters, a reasonablyintuitive recommendation for identifiers can be achieved. Thisapproach uses the full specification of identifier classes, as of aparticular version of the Unicode Standard, and permanently disallowsany characters not recommended in that version for inclusion inidentifiers. All code points unassigned as of that version would beallowed in identifiers, so that any future additions to the standardwould already be accounted for. This approach ensures both upwardlycompatible identifier stability and a reasonable division ofcharacters into those that do and do not make human sense as part ofidentifiers.

With or without such fine-tuning, such a compromise approachstill incurs the expense of implementing large lists of code points.While they no longer change over time, it is a matter of choicewhether the benefit of enforcing somewhat word-like identifiersjustifies their cost.

Alternatively, one can use the properties described below andallow all sequences of characters to be identifiers that are neitherPattern_Syntax nor Pattern_White_Space. This has the advantage ofsimplicity and small tables, but allows many more “unnatural”identifiers.

UAX31-R2.Alternative Identifiers: To meet this requirement, animplementation shall define identifiers to be any non-empty stringof characters that contains no character having any of thefollowing property values:

Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True

Alternatively, it shall declare that it uses aprofile anddefine that profile with a precise specification of the charactersthat are added to or removed from the sets of code points definedby these properties.

In its profile, a specification can define identifiers to bemore in accordance with the Unicode identifier definitions at thetime the profile is adopted, while still allowing for strictimmutability. For example, an implementation adopting a profile aftera particular version of Unicode is released (such as Unicode 5.0)could define the profile as follows:

All characters satisfyingR1 DefaultIdentifiers according to Unicode 5.0
Plus all code points unassigned in Unicode 5.0 that do nothave the property values specified inR2Alternative Identifiers.

This technique allows identifiers to have a more naturalformat—excluding symbols and punctuation already defined—yet alsoprovides absolute code point immutability.

Specifications should also include guidelines and recommendations forthose creating new identifiers. AlthoughR2Alternative Identifiers permits a wide range of characters, as abest practice identifiers should be in the format NFKC, without usingany unassigned characters. For more information on NFKC, see UnicodeStandard Annex #15, “Unicode Normalization Forms” [UAX15].

4Pattern Syntax

There are many circumstances where software interprets patternsthat are a mixture of literal characters, whitespace, and syntaxcharacters. Examples include regular expressions, Java collationrules, Excel or ICU number formats, and many others. In the past,regular expressions and other formal languages have been forced touse clumsy combinations of ASCII characters for their syntax. AsUnicode becomes ubiquitous, some of these will start to use non-ASCIIcharacters for their syntax: first as more readable optionalalternatives, then eventually as the standard syntax.

For forward and backward compatibility, it is advantageous to have afixed set of whitespace and syntax code points for use in patterns.This follows the recommendations that the Unicode Consortium has maderegarding completely stable identifiers, and the practice that isseen in XML 1.0, 5th Edition or later [XML]. (In particular, theUnicode Consortium is committed to not allocating characters suitablefor identifiers in the range U+2190..U+2BFF, which is being used byXML 1.0, 5th Edition.)

With a fixed set of whitespace and syntax code points, apattern language can then have a policy requiring all possible syntaxcharacters (even ones currently unused) to be quoted if they areliterals. Using this policy preserves the freedom to extend thesyntax in the future by using those characters. Past patterns onfuture systems will always work; future patterns on past systems willsignal an error instead of silently producing the wrong results.Consider the following scenario, for example.

In version 1.0 of program X, '≈' is a reserved syntaxcharacter; that is, it does not perform an operation, and it needsto be quoted. In this example, '\'quotes the nextcharacter; that is, it causes it to be treated as a literal insteadof a syntax character. In version 2.0 of program X, '≈' isgiven a real meaning—for example, “uppercase the subsequentcharacters”.
The pattern abc...\≈...xyz works on both versions 1.0 and2.0, and refers to the literal character because it is quoted inboth cases.
The pattern abc...≈...xyz works on version 2.0 anduppercases the following characters. On version 1.0, the engine(rightfully) has no idea what to do with ≈. Rather than silentlyfail (by ignoring ≈ or turning it into a literal), it has theopportunity to signal an error.

As of Unicode 4.1, two Unicode character properties are definedto provide for stable syntax: Pattern_White_Space andPattern_Syntax. Particular pattern languages may, of course,override these recommendations, for example, by adding or removingother characters for compatibility with ASCII usage.

For stability, the values of these properties are absolutelyinvariant, not changing with successive versions of Unicode. Ofcourse, this does not limit the ability of the Unicode Standard toencode more symbol or whitespace characters, but the syntax andwhitespace code points recommended for use in patterns will notchange.

Whengenerating rules or patterns, all whitespace and syntaxcode points that are to be literals require quoting, using whateverquoting mechanism is available. For readability, it is recommendedpractice to quote or escape all literal whitespace and defaultignorable code points as well.

Consider the following example, where the items in anglebrackets indicate literal characters:
a<SPACE>b → x<ZERO WIDTH SPACE>y + z;
Because <SPACE> is aPattern_White_Space character, it requires quoting. Because <ZERO WIDTH SPACE> is a default ignorablecharacter, it should also be quoted for readability. So in thisexample, if \uXXXX is used for a code point literal, but is resolvedbefore quoting, and if single quotes are used for quoting, thisexample might be expressed as:
'a\u0020b' → 'x\u200By' + z;

UAX31-R3.Pattern_White_Space and Pattern_Syntax Characters: To meet this requirement, animplementation shall use Pattern_White_Space characters as all andonly those characters interpreted as whitespace in parsing, andshall use Pattern_Syntax characters as all and only thosecharacters with syntactic use. Alternatively, it shall declare that it uses aprofile anddefine that profile with a precise specification of the charactersthat are added to or removed from the sets of code points definedby these properties.

Note: When meeting this requirement,all characters except those that have the Pattern_White_Space or Pattern_Syntax propertiesare available for use as identifiers or literals.

5Normalization and Case

This section discusses issues that must be taken into accountwhen considering normalization and case folding of identifiers inprogramming languages or scripting languages. Using normalizationavoids many problems where apparently identical identifiers are nottreated equivalently. Such problems can appear both duringcompilation and during linking—in particular across differentprogramming languages. To avoid such problems, programming languagescan normalize identifiers before storing or comparing them. Generallyif the programming language has case-sensitive identifiers, thenNormalization Form C is appropriate; whereas, if the programminglanguage has case-insensitive identifiers, then Normalization Form KCis more appropriate.

Implementations that take normalization and case into accounthave two choices: to treat variants as equivalent, or to disallowvariants.

UAX31-R4.Equivalent Normalized Identifiers: To meet this requirement, animplementation shall specify the Normalization Form and shallprovide a precise specification of the characters that are excludedfrom normalization, if any. If the Normalization Form is NFKC, theimplementation shall apply the modifications in Section 5.1,NFKC Modifications,given by the properties XID_Start and XID_Continue. Except foridentifiers containing excluded characters, any two identifiersthat have the same Normalization Form shall be treated asequivalent by the implementation.

UAX31-R5.Equivalent Case-Insensitive Identifiers: To meet thisrequirement, an implementation shall specify either simple or fullcase folding, and adhere to the Unicode specification for thatfolding. Any two identifiers that have the same case-folded formshall be treated as equivalent by the implementation.

UAX31-R6.Filtered Normalized Identifiers: To meet this requirement, animplementation shall specify the Normalization Form and shallprovide a precise specification of the characters that are excludedfrom normalization, if any. If the Normalization Form is NFKC, theimplementation shall apply the modifications in Section 5.1,NFKC Modifications,given by the properties XID_Start and XID_Continue. Except foridentifiers containing excluded characters, allowed identifiersmust be in the specified Normalization Form.

Note: For requirement R6, filtering involves disallowing anycharacters in the set [:NFKC_QuickCheck=No:], or equivalently, disallowing [:^isNFKC:].

UAX31-R7.Filtered Case-Insensitive Identifiers: To meet this requirement, animplementation shall specify either simple or full case folding,and adhere to the Unicode specification for that folding. Exceptfor identifiers containing excluded characters, allowed identifiersmust be in the specified Normalization Form.

Note: For requirement R7 with full case folding,filtering involves disallowing any characters in the set [:^isCasefolded:].

As of Unicode 5.2, an additional string transform is available foruse in matching identifiers:toNFKC_Casefold(S). SeeR5inSection 3.13, Default Case Algorithms in [Unicode]. That operationcase folds and normalizes a string, and also removes defaultignorable code points. It can be used to support an implementation ofEquivalent Case and Compatibility-Insensitive Identifiers.There is a corresponding boolean property, Changes_When_NFKC_Casefolded,which can be used to support an implementation ofFilteredCase and Compatibility-Insensitive Identifiers. The NFKC_Casefoldcharacter mapping property and the Changes_When_NFKC_Casefoldedproperty are described in Unicode Standard Annex #44, "UnicodeCharacter Database" [UAX44].

Note: In mathematically orientedprogramming languages that make distinctive use of the MathematicalAlphanumeric Symbols, such as U+1D400 MATHEMATICALBOLD CAPITAL A, an application of NFKC must filter characters toexclude characters with the property value Decomposition_Type=Font.

5.1NFKCModifications

Where programming languages are using NFKC to fold differencesbetween characters, they need the following modifications of theidentifier syntax from the Unicode Standard to deal with theidiosyncrasies of a small number of characters. These modificationsare reflected in the XID_Start and XID_Continue properties.

5.1.1Modifications for Characters that Behave Like Combining Marks

Certain characters are not formally combiningcharacters, although they behave in most respects as if they were.In most cases, the mismatch does not cause a problem, but when thesecharacters have compatibility decompositions, they can causeidentifiers not to be closed under Normalization Form KC. Inparticular, the following four characters are included inXID_Continue and not XID_Start:

U+0E33 THAI CHARACTER SARA AM
U+0EB3 LAO VOWEL SIGN AM
U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

5.1.2Modifications for Irregularly Decomposing Characters

U+037A GREEK YPOGEGRAMMENIand certain Arabic presentation forms have irregular compatibilitydecompositions and are excluded from both XID_Start andXID_Continue. It is recommended that all Arabic presentation formsbe excluded from identifiers in any event, although only a few ofthem must be excluded for normalization to guarantee identifierclosure.

5.1.3Identifier Closure Under Normalization

With these amendments to the identifier syntax, all identifiers areclosed under all four Normalization Forms. This means that for anystring S, the implications shown inFigure 5 hold.

Figure 5.NormalizationClosure

isIdentifier(S) → isIdentifier(toNFD(S)) isIdentifier(toNFC(S)) isIdentifier(toNFKD(S)) isIdentifier(toNFKC(S))

Identifiers are also closed under case operations. For any string S(with exceptions involving a single character), the implicationsshown inFigure 6 hold.

Figure 6.Case Closure

isIdentifier(S) → isIdentifier(toLowercase(S)) isIdentifier(toUppercase(S)) isIdentifier(toFoldedcase(S))

The one exception for casing is U+0345 COMBININGGREEK YPOGEGRAMMENI. In the very unusual case that U+0345 is at thestart of S, U+0345 is not in XID_Start, but its uppercase andcase-folded versions are. In practice, this is not a problem becauseof the way normalization is used with identifiers.

The reverse implication is true for canonical equivalence butnottrue in the case of compatibility equivalence:

Figure 7.ReverseNormalization Closure

`isIdentifier(toNFD(S)) isIdentifier(toNFC(S))`	→ `isIdentifier(S)`
`isIdentifier(toNFKD(S)) isIdentifier(toNFKC(S))`	↛ `isIdentifier(S)`

There are many characters for which the reverse implication is nottrue for compatibility equivalence, because there are many characterscounting as symbols or non-decimal numbers—and thus outside ofidentifiers—whose compatibility equivalents are letters or decimalnumbers and thus in identifiers. Some examples are shown inTable8.

Table 8.Compatibility Equivalents to Letters or Decimal Numbers

Code Points	GC	Samples	Names
2070	No	⁰	SUPERSCRIPT ZERO
20A8	Sc	₨	RUPEE SIGN
2116	So	№	NUMERO SIGN
2120..2122	So	℠..™	SERVICE MARK..TRADE MARK SIGN
2460..2473	No	①..⑳	CIRCLED DIGIT ONE..CIRCLED NUMBER TWENTY
3300..33A6	So	㌀..㎦	SQUARE APAATO..SQUARE KM CUBED

If an implementation needs to ensure both directions forcompatibility equivalence of identifiers, then the identifierdefinition needs to be tailored to add these characters.

For canonical equivalence the implication is true in both directions.isIdentifier(toNFC(S))if and only ifisIdentifier(S).

There were two exceptions before Unicode 5.1, as shown inTable9. If an implementation needs to ensure full canonical equivalenceof identifiers, then the identifier definition must be tailored sothat these characters have the same value, so that either bothisIdentifier(S) and isIdentifier(toNFC(S)) are true, or so that bothvalues are false.

Table 9.Canonical Equivalence Exceptions Prior to Unicode 5.1

isIdentifier(toNFC(S))=True	isIdentifier(S)=False	Different in
02B9 ( ʹ ) MODIFIER LETTER PRIME	0374 ( ʹ ) GREEK NUMERAL SIGN	XID and ID
00B7 ( · ) MIDDLE DOT	0387 ( · ) GREEK ANO TELEIA	XID alone

Those programming languages with case-insensitive identifiers shoulduse the case foldings described inSection 3.13, Default CaseAlgorithms, of [Unicode]to produce a case-insensitive normalized form.

When source text is parsed for identifiers, the folding ofdistinctions (using case mapping or NFKC) must be delayed until afterparsing has located the identifiers. Thus such folding ofdistinctions should not be applied to string literals or to commentsin program source text.

The Unicode Standard supports case folding with normalization, withthe function toNFKC_Casefold(X). See definition R5 inSection3.13, Default Case Algorithms in [Unicode] for thespecification of this function and further explanation of its use.

5.2Caseand Stability

The alphabetic case of the initial character of anidentifier is used as a mechanism to distinguish syntactic classes in some languages like Prolog, Erlang, Haskell,Clean, and Go. For example, in Prolog and Erlang,variables must begin with capital letters (or underscores) and atomsmust not. There are some complications in the use of this mechanism.

For such a casing distinction in a programminglanguage to work with unicameral writing systems (such as Kanji orDevanagari), another mechanism (such as underscores) needs tosubstitute for the casing distinction.

Casing stability is also an issue for bicameral writing systems. Theassignment of General_Category property values, such as gc=Lu, is not guaranteed tobe stable, nor is the assignment of characters to the broaderproperties such as Uppercase. So these property values cannot be usedby themselves, without incorporating a grandfathering mechanism, suchas is done for Unicode identifiers inSection 2.5Backward Compatibility. That is,the implementation would maintain its own list of special inclusionsand exclusions that require updating for each new version of Unicode.

Alternatively, a programming language specification can use theoperation specified inCaseFolding Stability as the basis for its casing distinction. That operationis guaranteed to be stable. Thatis, one can use a casing distinction such as the following:

S is avariable if S begins with anunderscore.
Otherwise, produce S' = toCasefold(toNFKC(S))
1. S is avariable if firstCodePoint(S) ≠firstCodePoint(S'),
2. otherwise S is anatom.

This test can clearly be optimized for thenormal cases, such as initial ASCII. It is also recommended thatidentifiers be in NFKC format, which makes the detection evensimpler.

5.2.1EdgeCases for Folding

In Unicode 8.0, the Cherokee script lettershave been changed from gc=Lo to gc=Lu, and corresponding lowercaseletters (gc=Ll) have been added. This is an unusual pattern; typicallywhen case pairs are added, existing letters are changed from gc=Lo togc=Ll, and new corresponding uppercase letters (gc=Lu) are added. Inthe case of Cherokee, it was felt that this solution provided themost compatibility for existing implementations in terms of fonttreatment.

The downside of this approach is that theCherokee characters, when case-folded, will convert as necessary tothe pre-8.0 characters, namely to the uppercase versions. Thisfolding is unlike that of any other case-mapped characters inUnicode. Thus the case-folded version of a Cherokee string willcontain uppercase letters instead of lowercase letters. Compatibilitywith fonts for the current user community was felt to be moreimportant than the confusion introduced by this edge case of casefolding, because Cherokee programmatic identifiers would be rare.

The upshot is thatwhen it comes to identifiers, implementations should never use theGeneral_Category or Lowercase or Uppercase properties to test forcasing conditions, nor use toUppercase(), toLowercase(), ortoTitlecase() to fold or test identifiers. Instead, they should instead use Case_Folding orNFKC_CaseFold.

Acknowledgments

Mark Davis is the author of the initial version and has addedto and maintained the text of this annex.

Thanks to Eric Muller, Asmus Freytag, Julie Allen, KennethWhistler, and Martin Duerst for feedback on this annex.

References

For references for this annex, see Unicode Standard Annex #41, “Common References for UnicodeStandard Annexes.”

Migration

Between Unicode Versions 5.2, 6.0 and 6.1, Table 5 was split inthree. In Version 6.1, the resulting tables were renumbered foreasier reference. The titles and links remain the same, forstability.

The following shows the correspondances:

Current Tables	Unicode 6.0	Unicode 5.2
Table 5,Recommended Scripts	5a	5
Table 6,Aspirational Use Scripts	5a
Table 7,Limited Use Scripts	5b
Table8, Compatibility Equivalents to Letters or Decimal Numbers	6	6
Table9, Canonical Equivalence Exceptions Prior to Unicode 5.1	7	7

Modifications

The following summarizes modifications from previous revisionsof this annex.

Revision 23 [MD, KW]

Reissued for Unicode 8.0.
Revision and correction of table styles and requirement styles.
Removed references to nonexistent levels in the conformance clauses.
MovedR1,R1a andR1b into more appropriate context ahead of Section 2.1, instead of at the bottom of Section 2.5,Backward Compatibility
Added links to figures and tables in the Table of Contents.
Section 2.4Specific Character Adjustments
- Added text about the decomposition of Middle Dot.
- Added 8.0 scripts to Table 4,Candidate Characters for Exclusion from Identifiers
Section 5.2Case and Stability added to discuss case distinctions used in Prolog, Erlang,Haskell, Clean, and Go programming identifers.

Revision 22 being a proposed update, only changes betweenrevisions 23 and 21 are noted here.

Revision 21

Reissued for Unicode 7.0.
Aligned the text of Table 2.LexicalClasses for Identifiers with the derivation in DerivedCoreProperties.txt.
Section 2.4SpecificCharacter Adjustments
- Changed the text on natural-language identifiers to have astronger recommendation for including the exception characters,and include the Catalan MIDDLE DOT
- Added the new 7.0 scripts to Table 4.
- Added pointer to Section 5,Normalization and Case in thediscussion of MIDDLE DOT.

Revision 20 being a proposed update, only changes betweenrevisions 21 and 19 are noted here.

Revision 19

Reissued for Unicode 6.3.0.
Added missing text for U+0D3F VOWEL SIGN I in Figure 3.
AddedFigure7 to clarify closure under normalization in Section 5.1NFKC Modifications.
Clarified text around conditionsA1, A2, and B in Section 2.3 Layout and FormatControl Characters, and moved some text to Limitations.
Minor edits.

Revision 18 being a proposed update, only changes betweenrevisions 19 and 17 are noted here.

Revision 17

Reissuedfor Unicode 6.2.0.
Removed two outdated links.
Typographic corrections.

Revision 16 being a proposed update, only changes betweenrevisions 17 and 15 are noted here.

Revision 15

Reissuedfor Unicode 6.1.0.
Added new scripts to the appropriate tables.
Added text in 1.3DisplayFormat and 2.3Layoutand Format Control Characters with more guidance on differencesbetween display and internal use of identifiers, and the use ofvariation selectors.
Split Table 5a into two for easier reference. RenumberedTables 5a1, 5a2, 5b, 6, and 7 to Tables 5, 6, 7, 8, and 9. Added aMigration section to show the relationship.
Made it clear that the lists of grandfathered characters insection 2.5 are illustrative.

Revision 14 being a proposed update, only changes betweenrevisions 15 and 13 are noted here.

Revision 13

Reissuedfor Unicode 6.0.0.
Corrected two code points in figure 2. Corrected 2 codepoints in figure 3. Updated format of figures 2, 3, 4 to use smallimages in tables instead of large images.
Clarified text in Section 2.3 Layout and FormatControl Characters.
Split Table 5 into two Tables,Table 5a, Recommended Scripts andTable5b,Limited Use Scripts.
Added Brahmi, Mandaic, and Batak toTable 4,CandidateCharacters for Exclusion from Identifiers andTable 5b,Limited Use Scripts.
Replaced discussion of case folding with normalization byparagraph pointing to toNFKC_Casefold(X).
Moved Syriac to Limited Use; qualified Tifinagh, Yi, UCAS,Mongolian as aspirational.
Deleted misleading sentence about stability, and addedunderbar to ID_Nonstart.
Added U+19DA to list of Other_ID_Continue characters inSection 2.5, BackwardCompatibility.

Revision 12 being a proposed update, only changes betweenrevisions 13 and 11 are noted here.

Revision 11

Reissuedfor Unicode 5.2.0.
Added pointer to name validation in #29.
Changed Qaai to Zinh.
Tightened the first two paragraphs in section 2.3 Layout and FormatControl Characters, and the subsection on Comparison. Addedclarification that the rules limit but do not prevent visualconfusability.
Moved scripts between Tables 4 and 5, and marked some aslimited use. Added pointer to CLDR.
Added note that there are also cases where a joiner is usedin front of a virama.
InA1. Allow ZWNJ in the following context,changed $L and $R to $LJ and $RJ respectively to disambiguate from$L meaning Letter; fixed property name to Joining_Type, and fixedthe text to correspond correctly to the regex.
Added HTML anchors to Figures and Tables.
Added toCandidate Characters for Inclusion in Identifiers: U+0F0B ( ་ )TIBETAN MARK INTERSYLLABIC TSHEG and U+30FB ( ・ ) KATAKANA MIDDLEDOT.
Added toCandidate Characters for Exclusion from Identifiers orRecommended Scripts: DefaultIgnorable Code Points, Tatweel (-like) characters, and scripts OldTurkic, Old South Arabian, Imperial Aramaic, Inscriptional Parthian,Inscriptional Pahlavi, Avestan, Egyptian Hieroglyphs, Samaritan,Kaithi, Lisu, Meetei Mayek, Tai Tham, Tai Viet, Javanese, Bamum.
Updated caption style for figures and tables.
Added table captions and centered Tables 6 and 7 in Section5.1.
Split the unnumbered identifier closure table in Section 5.1into two Figures and adjusted the surrounding text for clarity.
Removed borders around images, and redrew Figures 2, 3, 4for clarity.
Updated explanatory text for Figure 4.
Minor editorial cleanup.

Revision 10 being a proposed update, only changes betweenrevisions 11 and 9 are noted here.

Revision 9

Updatedfor Unicode 5.1.0.
Fixed Table 2 to exclude Pattern_Syntax andPattern_White_Space explicitly.
Added note underR2 AlternativeIdentifiers
Removed surrogates, private-use, and control from R2, addednotes.
Noted restrictions on ZWJ/ZWNJ are as applied to NFC.
Added Section 2.2 ModifierLetters and renumbered sections.
AddedTable 5, toshow other scripts.
Noted that both Tables will require updating with successiveversions of Unicode, as new scripts are added.
Broadened the discussion of Layout Controls to include otherDefault Ignorables in 2.3Layout and FormatControl Characters.
Minor reformatting of tables and figures, and addition ofcaptions to tables.
Added descriptions of scripts inTable4, Candidate Characters for Exclusion from Identifiers.
Added sentence about further restrictions to R1a.
Added line pointing to UTR36 for information about furtherrestrictions.
Added to discussion of canonical equivalence of identifiers.
Added filtered identifiers and rules.
Added format character discussion and rules.

Revision 8 being a proposed update, only changes betweenrevisions 9 and 7 are noted here.

Revision 7

Introduced the termprofile.
Added note on profiles of identifiers for natural languageinSection 2.3SpecificCharacter Adjustments
Minor editing for clarity in 2Default Identifier Syntax
Added note on spaces in identifiers (eg in SQL)

Revision 6 being a proposed update, only changes betweenrevisions 7 and 5 are noted here.

Revision 5

Removed section 4.1, because the two properties have beenaccepted for Unicode 4.1.
Expanded introduction
Adding information about stability, and tailoring foridentifiers.
Added the list of characters in Other_ID_Continue .
Changed <identifier_continue> and<identifier_start> to just use the property names, to avoidconfusion.
Included XID_Start and XID_Continue in R1 and elsewhere.
Added reference to UTR #36, and the phrase “or a list ofadditional constraints on identifiers” to R1.
Changed “Coverage” to “General Description of Coverage,”because the UCD value are definitive.
Added clarifications in 2.4
Revamped 2.2 Layout and Format Control Characters
Minor editing

Revision 3

Made draft UAX
Incorporated Annex 7 from UAX #15
Added Other_ID_Continue for Unicode 4.1
Added conformance clauses
Changed <identifier_extend> to<identifier_continue> to better match the property name.
Some additional edits.

Revision 2

Modified Pattern_White_Space to remove compatibilitycharacters
Added example explaining use of Pattern_White_Space

Revision 1

First version: incorporated section from Unicode 4.0 onIdentifiers plus new section on patterns.

Copyright © 2000–2015 Unicode,Inc. All Rights Reserved. The Unicode Consortium makes no expressedor implied warranty of any kind, and assumes no liability for errorsor omissions. No liability is assumed for incidental andconsequential damages in connection with or arising out of the use ofthe information or programs contained or accompanying this technicalreport. The UnicodeTermsof Use apply.

Unicode and the Unicode logo are trademarksof Unicode, Inc., and are registered in some jurisdictions.

Movatterモバイル変換