Python Enhancement Proposals

Python »
PEP Index »
PEP 3131

PEP 3131 – Supporting Non-ASCII Identifiers

Author:: Martin von Löwis <martin at v.loewis.de>
Status:

Abstract

This PEP suggests to support non-ASCII letters (such as accented characters,Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Python code is written by many people in the world who are notfamiliar with the English language, or even well-acquainted with theLatin writing system. Such developers often desire to define classesand functions with names in their native languages, rather than havingto come up with an (often incorrect) English translation of theconcept they want to name. By using identifiers in their nativelanguage, code clarity and maintainability of the code amongspeakers of that language improves.

For some languages, common transliteration systems exist (in particular, for theLatin-based writing systems). For other languages, users have largerdifficulties to use Latin to write their native words.

Common Objections

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so they haveto use characters they cannot type on their keyboards. However, it is thechoice of the designer of the library to decide on various constraints for usingthe library: people may not be able to use the library because they cannot getphysical access to the source code (because it is not published), or becauselicensing prohibits usage, or because the documentation is in a language theycannot understand. A developer wishing to make a library widely available needsto make a number of explicit choices (such as publication, licensing, languageof documentation, and language of identifiers). It should always be the choiceof the author to make these decisions - not the choice of the languagedesigners.

In particular, projects wishing to have wide usage probably might want toestablish a policy that all identifiers, comments, and documentation is writtenin English (see the GNU coding style guide for an example of such a policy).Restricting the language to ASCII-only identifiers does not enforce comments anddocumentation to be English, or the identifiers actually to be English words, soan additional policy is necessary, anyway.

Specification of Language Changes

The syntax of identifiers in Python will be based on theUnicode standard annexUAX-31, with elaboration and changesas defined below.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiersare the same as in Python 2.5. This specification only introduces additionalcharacters from outside the ASCII range. For other characters, theclassification uses the version of the Unicode Character Database as included intheunicodedata module.

The identifier syntax is<XID_Start><XID_Continue>*.

The exact specification of what characters have the XID_Start orXID_Continue properties can be found in theDerivedCorePropertiesfileof the Unicode data in use by Python (4.1 at the time thisPEP was written). For reference, the construction rulesfor these sets are given below. The XID_* properties are derivedfrom ID_Start/ID_Continue, which are derived themselves.

ID_Start is defined as all characters having one of the generalcategories uppercase letters (Lu), lowercase letters (Ll), titlecaseletters (Lt), modifier letters (Lm), other letters (Lo), letternumbers (Nl), the underscore, and characters carrying theOther_ID_Start property.XID_Start then closes this set undernormalization, by removing all characters whose NFKC normalizationis not of the form ID_Start ID_Continue* anymore.

ID_Continue is defined as all characters inID_Start, plusnonspacing marks (Mn), spacing combining marks (Mc), decimal number(Nd), connector punctuations (Pc), and characters carrying theOther_ID_Continue property. Again,XID_Continue closes this setunder NFKC-normalization; it also adds U+00B7 to support Catalan.

All identifiers are converted into the normal form NFKC while parsing;comparison of identifiers is based on NFKC.

A non-normative HTML file listing all valid identifier characters forUnicode 4.1 can be found athttps://web.archive.org/web/20081016132748/http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.

Policy Specification

As an addition to the Python Coding style, the following policy isprescribed: All identifiers in the Python standard library MUST useASCII-only identifiers, and SHOULD use English words wherever feasible(in many cases, abbreviations and technical terms are used whicharen’t English). In addition, string literals and comments must alsobe in ASCII. The only exceptions are (a) test cases testing thenon-ASCII features, and (b) names of authors. Authors whose names arenot based on the Latin alphabet MUST provide a Latin transliterationof their names.

As an option, this specification can be applied to Python 2.x. Inthat case, ASCII-only identifiers would continue to be represented asbyte string objects in namespace dictionaries; identifiers withnon-ASCII characters would be represented as Unicode strings.

Implementation

The following changes will need to be made to the parser:

If a non-ASCII character is found in the UTF-8 representation ofthe source code, a forward scan is made to find the first ASCIInon-identifier character (e.g. a space or punctuation character)
The entire UTF-8 string is passed to a function to normalize thestring to NFKC, and then verify that it follows the identifiersyntax. No such callout is made for pure-ASCII identifiers, whichcontinue to be parsed the way they are today. The Unicode databasemust start including the Other_ID_{Start|Continue} property.
If this specification is implemented for 2.x, reflective libraries(such as pydoc) must be verified to continue to work when Unicodestrings appear in__dict__ slots as keys.

Open Issues

John Nagle suggested consideration ofUnicode Technical Standard #39,which discusses security mechanisms for Unicode identifiers.It’s not clear how that can precisely apply to this PEP; possibleconsequences are

warn about characters listed as “restricted” in xidmodifications.txt
warn about identifiers using mixed scripts
somehow perform Confusable Detection

In the latter two approaches, it’s not clear how precisely thealgorithm should work. For mixed scripts, certain kinds of mixingshould probably allowed - are these the “Common” and “Inherited”scripts mentioned in section 5? For Confusable Detection, it seems oneneeds two identifiers to compare them for confusion - is it possibleto somehow apply it to a single identifier only, and warn?

In follow-up discussion, it turns out that John Nagle actuallymeant to suggestUTR#36,level “Highly Restrictive”.

Several people suggested to allow and ignore formatting controlcharacters (general category Cf), as is done in Java, JavaScript, andC#. It’s not clear whether this would improve things (it mightfor RTL languages); if there is a need, these can be addedlater.

Some people would like to see an option on selecting supportfor this PEP at run-time; opinions vary on what preciselythat option should be, and what precisely its default valueshould be.Guido van Rossum commentedthat a global flag passed to the interpreter is not acceptable, as it wouldapply to all modules.

Discussion

Ka-Ping Yee summarizes discussion and further objectionas such:

Should identifiers be allowed to contain any Unicode letter?
Drawbacks of allowing non-ASCII identifiers wholesale:
1. Python will lose the ability to make a reliable round trip toa human-readable display on screen or on paper.
2. Python will become vulnerable to a new class of security exploits;code and submitted patches will be much harder to inspect.
3. Humans will no longer be able to validate Python syntax.
4. Unicode is young; its problems are not yet well understood andsolved; tool support is weak.
5. Languages with non-ASCII identifiers use different character setsand normalization schemes;PEP 3131’s choices are non-obvious.
6. The Unicode bidi algorithm yields an extremely confusing displayorder for RTL text when digits or operators are nearby.
Should the default behaviour accept only ASCII identifiers, orshould it accept identifiers containing non-ASCII characters?
Arguments for ASCII only by default:
1. Non-ASCII identifiers by default makes common practice/assumptionssubtly/unknowingly wrong; rarely wrong is worse than obviously wrong.
2. Better to raise a warning than to fail silently when encounteringa probably unexpected situation.
3. All of current usage is ASCII-only; the vast majority of futureusage will be ASCII-only.
1. It is the pockets of Unicode adoption that are parochial, not theASCII advocates.
2. Python should audit for ASCII-only identifiers for the samereasons that it audits for tab-space consistency
3. Incremental change is safer.
4. An ASCII-only default favors open-source development and sharingof source code.
5. Existing projects won’t have to waste any brainpower worryingabout the implications of Unicode identifiers.
Should non-ASCII identifiers be optional?
Various voices in support of a flag (although there’s been debateover which should be the default, no one seems to be saying thatthere shouldn’t be an off switch)
Should the identifier character set be configurable?
Various voices proposing and supporting a selectable character set,so that users can get all the benefits of using their own languagewithout the drawbacks of confusable/unfamiliar characters
Which identifier characters should be allowed?
1. What to do about bidi format control characters?
2. What about other ID_Continue characters? What about charactersthat look like punctuation? What about other recommendationsin UTS #39? What about mixed-script identifiers?
Which normalization form should be used, NFC or NFKC?
Should source code be required to be in normalized form?

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-3131.rst

Last modified:2025-02-11 05:10:05 GMT

Movatterモバイル変換