This PEP suggests to support non-ASCII letters (such as accented characters,Cyrillic, Greek, Kanji, etc.) in Python identifiers.
Python code is written by many people in the world who are notfamiliar with the English language, or even well-acquainted with theLatin writing system. Such developers often desire to define classesand functions with names in their native languages, rather than havingto come up with an (often incorrect) English translation of theconcept they want to name. By using identifiers in their nativelanguage, code clarity and maintainability of the code amongspeakers of that language improves.
For some languages, common transliteration systems exist (in particular, for theLatin-based writing systems). For other languages, users have largerdifficulties to use Latin to write their native words.
Some objections are often raised against proposals similar to this one.
People claim that they will not be able to use a library if to do so they haveto use characters they cannot type on their keyboards. However, it is thechoice of the designer of the library to decide on various constraints for usingthe library: people may not be able to use the library because they cannot getphysical access to the source code (because it is not published), or becauselicensing prohibits usage, or because the documentation is in a language theycannot understand. A developer wishing to make a library widely available needsto make a number of explicit choices (such as publication, licensing, languageof documentation, and language of identifiers). It should always be the choiceof the author to make these decisions - not the choice of the languagedesigners.
In particular, projects wishing to have wide usage probably might want toestablish a policy that all identifiers, comments, and documentation is writtenin English (see the GNU coding style guide for an example of such a policy).Restricting the language to ASCII-only identifiers does not enforce comments anddocumentation to be English, or the identifiers actually to be English words, soan additional policy is necessary, anyway.
The syntax of identifiers in Python will be based on theUnicode standard annexUAX-31, with elaboration and changesas defined below.
Within the ASCII range (U+0001..U+007F), the valid characters for identifiersare the same as in Python 2.5. This specification only introduces additionalcharacters from outside the ASCII range. For other characters, theclassification uses the version of the Unicode Character Database as included intheunicodedata module.
The identifier syntax is<XID_Start><XID_Continue>*.
The exact specification of what characters have the XID_Start orXID_Continue properties can be found in theDerivedCorePropertiesfileof the Unicode data in use by Python (4.1 at the time thisPEP was written). For reference, the construction rulesfor these sets are given below. The XID_* properties are derivedfrom ID_Start/ID_Continue, which are derived themselves.
ID_Start is defined as all characters having one of the generalcategories uppercase letters (Lu), lowercase letters (Ll), titlecaseletters (Lt), modifier letters (Lm), other letters (Lo), letternumbers (Nl), the underscore, and characters carrying theOther_ID_Start property.XID_Start then closes this set undernormalization, by removing all characters whose NFKC normalizationis not of the form ID_Start ID_Continue* anymore.
ID_Continue is defined as all characters inID_Start, plusnonspacing marks (Mn), spacing combining marks (Mc), decimal number(Nd), connector punctuations (Pc), and characters carrying theOther_ID_Continue property. Again,XID_Continue closes this setunder NFKC-normalization; it also adds U+00B7 to support Catalan.
All identifiers are converted into the normal form NFKC while parsing;comparison of identifiers is based on NFKC.
A non-normative HTML file listing all valid identifier characters forUnicode 4.1 can be found athttps://web.archive.org/web/20081016132748/http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
As an addition to the Python Coding style, the following policy isprescribed: All identifiers in the Python standard library MUST useASCII-only identifiers, and SHOULD use English words wherever feasible(in many cases, abbreviations and technical terms are used whicharen’t English). In addition, string literals and comments must alsobe in ASCII. The only exceptions are (a) test cases testing thenon-ASCII features, and (b) names of authors. Authors whose names arenot based on the Latin alphabet MUST provide a Latin transliterationof their names.
As an option, this specification can be applied to Python 2.x. Inthat case, ASCII-only identifiers would continue to be represented asbyte string objects in namespace dictionaries; identifiers withnon-ASCII characters would be represented as Unicode strings.
The following changes will need to be made to the parser:
__dict__ slots as keys.John Nagle suggested consideration ofUnicode Technical Standard #39,which discusses security mechanisms for Unicode identifiers.It’s not clear how that can precisely apply to this PEP; possibleconsequences are
In the latter two approaches, it’s not clear how precisely thealgorithm should work. For mixed scripts, certain kinds of mixingshould probably allowed - are these the “Common” and “Inherited”scripts mentioned in section 5? For Confusable Detection, it seems oneneeds two identifiers to compare them for confusion - is it possibleto somehow apply it to a single identifier only, and warn?
In follow-up discussion, it turns out that John Nagle actuallymeant to suggestUTR#36,level “Highly Restrictive”.
Several people suggested to allow and ignore formatting controlcharacters (general category Cf), as is done in Java, JavaScript, andC#. It’s not clear whether this would improve things (it mightfor RTL languages); if there is a need, these can be addedlater.
Some people would like to see an option on selecting supportfor this PEP at run-time; opinions vary on what preciselythat option should be, and what precisely its default valueshould be.Guido van Rossum commentedthat a global flag passed to the interpreter is not acceptable, as it wouldapply to all modules.
Ka-Ping Yee summarizes discussion and further objectionas such:
Drawbacks of allowing non-ASCII identifiers wholesale:
Arguments for ASCII only by default:
Various voices in support of a flag (although there’s been debateover which should be the default, no one seems to be saying thatthere shouldn’t be an off switch)
Various voices proposing and supporting a selectable character set,so that users can get all the benefits of using their own languagewithout the drawbacks of confusable/unfamiliar characters
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-3131.rst
Last modified:2025-02-11 05:10:05 GMT