Movatterモバイル変換

[0]ホーム

Doc. no.:	WG14/N1518 WG21/N3146 PL22.16/10-0136
Date:	2010-10-04
Reply to:	Clark Nelson
Phone:	+1-503-712-8433
Email:	clark.nelson@intel.com

Recommendations for extended identifier characters for C and C++

Introduction

In response to their 2010 FCD ballot, WG21 received the following comment (designatedas CA 24) from the Canadian national body:

A list of issues related TR 10176:2003
"Combining characters should not appear as the first character of an identifier."Reference: ISO/IEC TR 10176:2003 (Annex A) This is not reflected in FCD.
Restrictions on the first character of an identifier are not observed as recommendedin TR 10176:2003. The inclusion of digits (outside of those in the basic characterset) underidentifer-nondigitis implied by FCD.
It is implied that only the "main listing" from Annex A is included for C++. Thatis, the list ends with the Special Characters section. This is not made explicitin FCD. Existing practice in C++03 as well as WG 14 (C, as of N1425) and WG 4 (COBOL,as of N4315) is to include a list in a normative Annex.
Specify width sensitivity as implied by C++03:\uFF21 is not the sameasA. Case sensitivity is already stated in [lex.name].

It is reasonable to expect that WG14 would receive a very similar comment in responseto an upcoming ballot.

Background

In investigating what various standards say about extended characters in identifiers,the following facts came to light.

General

The C++ WD now cites TR 10176:2003 for the specification of extended identifiercharacters.
The C standard (and WD) incorporates the lists from TR 10176:1998 for the specificationof valid extended identifier characters.
There are differences between the lists in the C standard and in TR 10176:2003.
The lists in TR 10176:2003 are recommended as theminimum set ofcharacters that an implementation should allow in identifiers.
The C standard explicitly allows implementations to accept other characters (implementation-defined)outside the basic source character set in identifiers.
The C++ standard (and WD) gives no such permission.
In C, a UCN not in the set of those specified as allowed is undefined behavior.
In C++, a UCN not in the set of those specified as allowed requires a diagnostic.
TR 10176 is based on the same principles as the "Default Identifier Syntax" definedin Unicode UAX#31 (see [DefId]): very roughly, characters defined to be lettersare allowed initially, characters defined to be digits and combining marks are allowednon-initially.
The identifier character set in XML 1.0 was originally defined using the same principles.
For the sake of stability in the presence of an expanding character set, UAX#31also defines an "Alternative Identifier Syntax" (see [AltId]), in which anythingis allowed but: white space, "syntax" characters, private use characters, surrogates,control characters, and non-characters.
Unicode defines "syntax" characters to include most punctuation and symbols (includingmathematical operators).
XML 1.1 also defines the identifier character set generously, excluding only certaincharacters and character ranges.
The latest edition of XML 1.0 has adopted the identifier specification from XML1.1.

Combining characters

The C standard (and presumably TR 10176:1998) does not list combining characters,compatibility presentation forms, or fullwidth or halfwidth forms as valid in identifiers.
TR 10176:2003 has a separate list, also in Annex A, for combining characters, compatibilitypresentation forms, fullwidth & halfwidth forms, etc.
The C++ standard references Annex A, without even acknowledging that it has twoparts.
TR 10176:2003 recommends that a combining character should not appear as the firstcharacter of an identifier.

Digits

TR 10176:2003 gives digits as an example of a kind of character often not allowedinitially, and calls them out as a separate category in Annex A, but makes no actualrecommendation.
In C, a UCN representing a digit is not allowed to start an identifier.
The C++ standard has no such restriction.

Halfwidth and fullwidth variants

TR 10176:2003 makes no recommendation with respect to halfwidth or fullwidth variants,but notes that COBOL (in particular) considers halfwidth and fullwidth variantsto be equivalent to the original character (at least in some contexts).
The C standard does not explicitly include halfwidth or fullwidth variants as identifiercharacters.
The C++ standard is silent on this topic.

Discussion

General

WG14 and WG21 have experienced that trying to keep a language standard in synchwith an expanding character set definition can be problematic. The Unicode Consortiumand the World Wide Web Consortium have both acknowledged this, and provided (normative)guidance to avoid the problem. Although TR 10176 is based on problematic principles,the recommendation thatat least the specified characters shouldbe accepted in identifiers can be completely satisfied using the [AltId] approach.Therefore,it seems reasonable to abandon the [DefId] and TR 10176 approach, in favor of somethingsimpler and more stable.

In a sane world, C and C++ would use the same definition of valid extended identifiercharacters. In an ideal world, the definition would appear in the C standard, andbe referenced by the C++ standard. The publication schedules of WG14 and WG21 wouldappear to put an ideal world out of reach. But putting textually identical specificationsin annexes of both C and C++, based on the [AltId] principles, would seem to befeasible.

Specifics

Estalishing the principle by which the set of valid extended identifier characterswill be defined is clearly not enough; it is also necessary to select the exactdefinition.

In an ideal world, it would be possible to cite a definition from a different standard(as C++ was going to attempt to do by referencing [TR2003]). Citing the identifiersyntax from [XML2008] would be problematic, for several reasons:

It's not an ISO standard.
It covers so much more than just the identifier syntax.
Its identifier syntax allows some inappropriate characters (including the basicsource characters "-", "." and ":",and several punctuation marks used in CJK text), and disallows some inappropriatecharacters (including "ª","µ" and "º",which were allowed in C99).

The content and organization of UAX#31 would make it easier to cite. It's stillnot an ISO standard, but that may not be an insuperable problem. In general, thecurrent definition of [AltId] seems to be better than [XML2008] at tracking recentassignments of blocks in Unicode. But the exact definition of [AltId] has what appearsto be a very serious flaw.

[AltId] defines a property called "Pattern_White_Space"; characters havingthat property are disallowed as identifier characters.Several other categories ofcharacters are also disallowed, including "Pattern_Syntax" charactersand control characters. The "Pattern_White_Space" category includes theASCII SPACE character, several control characters (including HT, LF and CR), anda few others (including the Unicode LINE SEPARATOR and PARAGRAPH SEPARATOR characters)— but it does not include several other characters defined as spaces, includingthe Latin-1 NO-BREAK SPACE (NBSP).

In addition, it should be noted that [AltId] seems to be unique among standardsand recommendations for identifiers, in having no restriction on characters allowedinitially.

Combining characters

There is a fairly serious technical reason why an identifier should not start witha combining character: a combining character combines, semantically and visually,with the characterpreceding it. So disallowing this would preventpotentially serious, gratuitous confusion. Thus TR 10176 (reasonably) recommendsthat it be disallowed.

But [AltId] imposes no restrictions on the first character of an identifier. Onthe other hand, [XML2008] disallows (some) general combining characters initially,but not any script-specific combining character.

It should be noted that the C and C++ standards published to date appear to scrupulouslydisallow all combining characters in identifiers — both general and script-specific.Therefore,it has not previously been necessary to place restrictions on whether to allow theminitially. It is difficult to imagine how to continue to avoid this problem usingan extended identifier character definition based on [AltId].

Digits

While an identifier starting with a script-specific digit might be confused witha number by a human (who recognizes the digit as such), it will not be so confusedby a C or C++ compiler — unless the compiler has been specifically extendedto recognize script-specific numerical literals. Thus the problem of initial script-specificdigits is less severe than the problem of initial combining characters (which isinherent in the structure of Unicode). TR 10176 does not go so far as to actuallyrecommend that they be disallowed — it only mentions that they might be disallowed.

To date, C++ has never allowed script-specific digits as identifier characters,initially or otherwise. C allows them. While C disallows them initially, the "shall"imposing the requirement is not in a "Constraint" section, which of coursemeans that implementations have not, strictly speaking, been required to diagnosethem; instead, an identifier starting with a script-specific digit yields undefinedbehavior.

Halfwidth and fullwidth variants

This is the tip of a very large iceberg; seeUAX#15 (if you dare) for much, much more information. Basically, width variationsare just one of a dozen or so ways in which one character, or sequence of characters,can be confused for another. Unicode defines four different normalization forms,which can be used to resolve these confusions. The fact that there are multiplenormalization forms can reasonably be taken as an indication of the complexity ofthe situation.

Any standard (including COBOL and TR 10176) that talks about width variations, andno other form of canonical or compatibility equivalence, is very likely demonstratinga potent combination of hubris and ignorance.

I think the C and C++ standards should be silent on this whole topic. An mplementershould be able to decide whether his implementation should normalize or not, andif so which normalization form should be used, based on his understanding of theneeds of his customers. The implication of that would be that users should nevername different things using identifiers that would normalize to the same string,nor attempt to reference something using anything but its exact name (for example,by using a name that would normalize to the same string as the original name).

Recommendations

The definition of the ranges of UCNs allowed in an identifier should appear in anannex in each standard; the text of these two annexes should be identical. (Thewording of the citations of these annexes will be different between the two standards.)

The set of UCNsdisallowed in identifiers in C and C++ should exactlymatch the specification in [AltId],with the following additions:all characters in the Basic Latin (i.e. ASCII, basic source character) block, andall characters in the Unicode General Category "Separator, space".

General combining characters, appearing in blocks dedicated to that purpose, shouldbe disallowed as the initial character of an identifier. (Script-specific combiningcharacters would be allowed initially, only because of the additional complexityand instability of specifying them.)

There should be no restriction on script-specific digits initally in an identifier.

Proposed wording

The annex specifying the ranges of allowed identifier characters should be AnnexD in the C standard, and Annex E in the C++ standard. It should have two sub-clauses:one for allowed identifier characters, and one for characters disallowed initially.

Annex

X. Universal character names for identifier characters (normative)
X.1 Ranges of characters allowed
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6,00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD, 60000-6FFFD, 70000-7FFFD,80000-8FFFD, 90000-9FFFD, A0000-AFFFD, B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
X.2 Ranges of characters disallowed initially
0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

Citing the annex from C

Change 6.2.4.1p3:

Each universal character name in an identifier shall designate a character whoseencoding in ISO/IEC 10646 falls into one of the ranges specified in annex D, subclauseD.1.⁷¹⁾ The initial character shall not be a universal charactername designating a~~digit~~character whose encoding falls into one ofthe ranges specified in subclause D.2. An implementation may allow multibytecharacters that are not part of the basic source character set to appear in identifiers;which characters and their correspondence to universal character names is implementation-defined.

Citing the annex from C++

Change 2.11p1:

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-namein an identifier shall designate a character whose encoding in ISO 10646 falls intoone of the ranges specified in~~Annex A of TR 10176:2003~~annex E, subclauseE.1. The initial element shall not be a universal-character-name designating a characterwhose encoding falls into one of the ranges specified in subclause E.2.Upper- and lower-case letters are different. All characters are significant.¹⁹

It may also be appropriate to delete the normative reference to TR 10176:2003. Eventhough the standard will (substantially) follow its recommendations for extendedidentifier characters, there will remain no actual reference to it.

References

[AltId]: Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax, "AlternativeIdentifier Syntax",http://www.unicode.org/reports/tr31/tr31-11.html#Alternative_Identifier_Syntax
[DefId]: Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax, "Default IdentifierSyntax",http://www.unicode.org/reports/tr31/tr31-11.html#Default_Identifier_Syntax
[TR2003]: ISO/IEC TR 10176:2003 (WG 20 DTR ballot draft),http://www.open-std.org/JTC1/sc22/WG20/docs/n970-tr10176-2002.pdf
[TR2003a]: Online text of identifier character repertoire recommended by [TR2003],http://www.iso.org/ittf/ISOIEC_TR_10176_2003_Table.txt
[XML2006]: Extensible Markup Language (XML) 1.0 (Fourth Edition), "Common Syntactic Constructs",http://www.w3.org/TR/2006/REC-xml-20060816/#sec-common-syn
[XML2008]: Extensible Markup Language (XML) 1.0 (Fifth Edition), "Common Syntactic Constructs",http://www.w3.org/TR/2008/REC-xml-20081126/#sec-common-syn

Appendix: [AltId] and [XML2008] illustrated

The Basic Multilingual Plane

Legend: Characters disallowed by [AltId] are indicated by abox.Characters disallowed by [XML2008] are indicated by agray background.

Every block is presented, most only by name. Certain blocks (especially includingthose corresponding to the ASCII and Latin-1 "legacy encodings") havesignificant numbers of punctuation and symbol characters; these are presented characterby character. Each assigned, non-control character appears in the HTML source asthe appropriate character entity; if it doesn't display correctly in your browser,the fault is almost certainly in your browser setup. The formal name of each characteris also present an HTML title, which will hopefully pop up if you hover your mousepointer over the character.

It should be noted that [AltId] is unique in having no distinction between charactersallowed initially and non-initially in an identifier. Where [XML2008] makes sucha distinction, it is indicated below as a note. There are also notes for a few isolated(plausible) non-identifier characters.

0000	Basic Latin
0000
0010
0020		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/	XML disallows-and. initially
0030	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?	XML allows: initially, disallows digitsinitially
0040	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
0050	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
0060	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
0070	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~
0800	Latin-1 Supplement
0080
0090
00A0		¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬		®	¯
00B0	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿	XML disallows· initially
00C0	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
00D0	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
00E0	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
00F0	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

0100	Latin Extended-A
0180 0200	Latin Extended-B
0250	IPA Extensions, Spacing Modifier Letters
0300	Combining Diacritical Marks	XML disallows these initally
0370	Greek and Coptic	XML specifically disallows;
0400	Cyrillic
0500	Cyrillic Supplement, Armenian, Hebrew
0600	Arabic
0700	Syriac, Arabic Supplement, Thaana, NKo
0800	Samaritan
0900	Devanagari, Bengali
0A00	Gurmukhi, Gujarati
0B00	Oriya, Tamil
0C00	Telugu, Kannada
0D00	Malayalam, Sinhala
0E00	Thai, Lao
0F00	Tibetan
1000	Myanmar, Georgian
1100	Hangul Jamo
1200 1300	Ethiopic
1380	Ethiopic Supplement, Cherokee
1400 1600	Unified Canadian Aboriginal Syllabics
1680	Ogham, Runic	The Ogham block contains a script-specific space:
1700	Tagalog, Hanunoo, Buhid, Tagbanwa, Khmer
1800	Mongolian, Unified Canadian Aboriginal Syllabics Extended	The Mongolian block contains a script-specific space:᠎
1900	Limbu, Tai Le, New Tai Lue, Khmer Symbols
1A00	Buginese, Tai Tham
1B00	Balinese, Sundanese
1C00	Lepcha, Ol Chiki, Vedic Extensions
1D00	Phonetic Extensions, Phonetic Extensions Supplement
1DC0	Combining Diacritical Marks Supplement	XML doesnot disallow these initially
1E00	Latin Extended Additional
1F00	Greek Extended

2000	General Punctuation
2000													‌	‍	‎	‏
2010	‐	‑	‒	–	—	―	‖	‗	‘	’	‚	‛	“	”	„	‟
2020	†	‡	•	‣	․	‥	…	‧			‪	‫	‬	‭	‮
2030	‰	‱	′	″	‴	‵	‶	‷	‸	‹	›	※	‼	‽	‾	‿	XML disallows‿ initially
2040	⁀	⁁	⁂	⁃	⁄	⁅	⁆	⁇	⁈	⁉	⁊	⁋	⁌	⁍	⁎	⁏	XML disallows⁀ initially
2050	⁐	⁑	⁒	⁓	⁔	⁕	⁖	⁗	⁘	⁙	⁚	⁛	⁜	⁝	⁞
2060	⁠	⁡	⁢	⁣	⁤						⁪	⁫	⁬	⁭	⁮	⁯

2070	Superscripts and Subscripts
20A0	Currency Symbols
20D0	Combining Diacritical Marks for Symbols	XML doesnot disallow these initially
2100	Letterlike Symbols
2150	Number Forms
2190	Arrows
2200	Mathematical Operators
2300	Miscellaneous Technical
2400	Control Pictures
2440	Optical Character Recognition
2460	Enclosed Alphanumerics
2500	Box Drawing
2580	Block Elements
25A0	Geometric Shapes
2600	Miscellaneous Symbols
2700	Dingbats
2776	Dingbats (circled digits)
2794	Dingbats
27C0	Miscellaneous Mathematical Symbols-A
27F0	Supplemental Arrows-A
2800	Braille Patterns
2900	Supplemental Arrows-B
2980	Miscellaneous Mathematical Symbols-B
2A00	Supplemental Mathematical Operators
2B00	Miscellaneous Symbols and Arrows

2C00	Glagolitic, Latin Extended-C, Coptic
2D00	Georgian Supplement, Tifinagh, Ethiopic Extended, Cyrillic Extended-A
2E00	Supplemental Punctuation
2E80	CJK Radicals Supplement, Kangxi Radicals
2FF0	Ideographic Description

3000	CJK Symbols and Punctuation
3000		、	。	〃	〄	々	〆	〇	〈	〉	《	》	「	」	『	』
3010	【	】	〒	〓	〔	〕	〖	〗	〘	〙	〚	〛	〜	〝	〞	〟
3020	〠	〡	〢	〣	〤	〥	〦	〧	〨	〩	〪	〫	〬	〭	〮	〯
3030	〰	〱	〲	〳	〴	〵	〶	〷	〸	〹	〺	〻	〼	〽	〾	〿

3040	Hiragana, Katakana
3100	Bopomofo, Hangul Compatibility Jamo, Kanbun, Bopomofo Extended, CJK Strokes, KatakanaPhonetic Extensions
3200	Enclosed CJK Letters and Months
3300	CJK Compatibility
3400 4D00	CJK Unified Ideographs Extension A
4DC0	Yijing Hexagram Symbols
4E00 9F00	CJK Unified Ideographs
A000 A400	Yi Syllables
A490	Yi Radicals, Lisu
A500 A600	Vai
A640	Cyrillic Extended-B, Bamum
A700	Modifier Tone Letters, Latin Extended-D
A800	Syloti Nagri, Common Indic Number Forms, Phags-pa, Saurashtra, Devanagari Extended
A900	Kayah Li, Rejang, Hangul Jamo Extended-A, Javanese
AA00	Cham, Myanmar Extended-A, Tai Viet
AB00	Meetei Mayek
AC00 D700	Hangul Syllables
D7B0	Hangul Jamo Extended-B
D800 DB00	High Surrogates
DB80	High Private Use Surrogates
DC00 DF00	Low Surrogates
E000 F800	Private Use Area
F900 FA00	CJK Compatibility Ideographs
FB00	Alphabetic Presentation Forms
FB50 FD00	Arabic Presentation Forms-A
FDD0	(non-characters)
FDF0	Arabic Presentation Forms-A	This block contains Pattern_Syntax characters﴾ and﴿
FE00	Variation Selectors, Vertical Forms
FE20	Combining Half Marks	XML doesnot disallow these initally
FE30	CJK Compatibility Forms	This block also contains Pattern_Syntax characters﹅ and﹆
FE50	Small Form Variants, Arabic Presentation Forms-B
FF00	Halfwidth and Fullwidth Forms

	Specials
FFF0														�

Beyond the BMP

The Supplementary Private Use Area extends from F0000 through 10FFFF; both [AltId]and [XML2008] disallow characters in that range.

In addition, [AltId] disallows, as non-characters, the last two code positions ofeach plane, i.e. every position of the formPFFFE orPFFFF, for any value ofP.

Otherwise, no character outside the BMP is disallowed as an identifier characterby either specification.

[8]ページ先頭

Movatterモバイル変換

Recommendations for extended identifier characters for C and C++

Introduction

Background

General

Combining characters

Digits

Halfwidth and fullwidth variants

Discussion

General

Specifics

Combining characters

Digits

Halfwidth and fullwidth variants

Recommendations

Proposed wording

Annex

`X`. Universal character names for identifier characters (normative)

`X`.1 Ranges of characters allowed

`X`.2 Ranges of characters disallowed initially

Citing the annex from C

Citing the annex from C++

References

Appendix: [AltId] and [XML2008] illustrated

The Basic Multilingual Plane

Beyond the BMP

Movatterモバイル変換

Recommendations for extended identifier characters for C and C++

Introduction

Background

General

Combining characters

Digits

Halfwidth and fullwidth variants

Discussion

General

Specifics

Combining characters

Digits

Halfwidth and fullwidth variants

Recommendations

Proposed wording

Annex

X. Universal character names for identifier characters (normative)

X.1 Ranges of characters allowed

X.2 Ranges of characters disallowed initially

Citing the annex from C

Citing the annex from C++

References

Appendix: [AltId] and [XML2008] illustrated

The Basic Multilingual Plane

Beyond the BMP

`X`. Universal character names for identifier characters (normative)

`X`.1 Ranges of characters allowed

`X`.2 Ranges of characters disallowed initially