Commit2ff8608

authored

gh-135676: Simplify docs on lexing names (GH-140464)

This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section.It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but:- parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators- normalizes the name- validates the name, using the xid_start/xid_continue setsCo-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>Co-authored-by: Blaise Pabon <blaise@gmail.com>Co-authored-by: Micha Albert <info@micha.zone>Co-authored-by: KeithTheEE <kmurrayis@gmail.com>

1 parentc359ea4 commit2ff8608Copy full SHA for 2ff8608

File tree

1 file changed

+103

-58

lines changed

Doc/reference
- lexical_analysis.rst

1 file changed

+103

-58

lines changed

`‎Doc/reference/lexical_analysis.rst‎`

Lines changed: 103 additions & 58 deletions

Original file line number	Diff line number	Diff line change
`@@ -386,73 +386,29 @@ Names (identifiers and keywords)`
`386`	`386`	:data:`~token.NAME` tokens represent identifiers, keywords, and
`387`	`387`	`soft keywords.`
`388`	`388`
`389`		`-Within the ASCII range (U+0001..U+007F), the valid characters for names`
`390`		-include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
`391`		-the underscore ``_`` and, except for the first character, the digits
`392`		-``0`` through ``9``.
	`389`	`+Names are composed of the following characters:`
	`390`	`+`
	`391`	+* uppercase and lowercase letters (``A-Z`` and ``a-z``),
	`392`	+* the underscore (``_``),
	`393`	+* digits (``0`` through ``9``), which cannot appear as the first character, and
	`394`	`+* non-ASCII characters. Valid names may only contain "letter-like" and`
	`395`	+ "digit-like" characters; see:ref:`lexical-names-nonascii` for details.
`393`	`396`
`394`	`397`	`Names must contain at least one character, but have no upper length limit.`
`395`	`398`	`Case is significant.`
`396`	`399`
`397`		-Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
`398`		`-and "number-like" characters from outside the ASCII range, as detailed below.`
`399`		`-`
`400`		-All identifiers are converted into the `normalization form`_ NFKC while
`401`		`-parsing; comparison of identifiers is based on NFKC.`
`402`		`-`
`403`		`-Formally, the first character of a normalized identifier must belong to the`
`404`		-set ``id_start``, which is the union of:
`405`		`-`
`406`		-* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
`407`		-* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
`408`		-* Unicode category ``<Lt>`` - titlecase letters
`409`		-* Unicode category ``<Lm>`` - modifier letters
`410`		-* Unicode category ``<Lo>`` - other letters
`411`		-* Unicode category ``<Nl>`` - letter numbers
`412`		-* {``"_"``} - the underscore
`413`		-* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
`414`		`- to support backwards compatibility`
`415`		`-`
`416`		-The remaining characters must belong to the set ``id_continue``, which is the
`417`		`-union of:`
`418`		`-`
`419`		-* all characters in ``id_start``
`420`		-* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
`421`		-* Unicode category ``<Pc>`` - connector punctuations
`422`		-* Unicode category ``<Mn>`` - nonspacing marks
`423`		-* Unicode category ``<Mc>`` - spacing combining marks
`424`		-* ``<Other_ID_Continue>`` - another explicit set of characters in
`425`		- `PropList.txt`_ to support backwards compatibility
`426`		`-`
`427`		`-Unicode categories use the version of the Unicode Character Database as`
`428`		-included in the:mod:`unicodedata` module.
`429`		`-`
`430`		-These sets are based on the Unicode standard annex `UAX-31`_.
`431`		-See also:pep:`3131` for further details.
`432`		`-`
`433`		`-Even more formally, names are described by the following lexical definitions:`
	`400`	`+Formally, names are described by the following lexical definitions:`
`434`	`401`
`435`	`402`	`..grammar-snippet::`
`436`	`403`	`:group: python-grammar`
`437`	`404`
`438`		- NAME: `xid_start` `xid_continue`*
`439`		`- id_start: <Lu> \| <Ll> \| <Lt> \| <Lm> \| <Lo> \| <Nl> \| "_" \| <Other_ID_Start>`
`440`		- id_continue: `id_start` \| <Nd> \| <Pc> \| <Mn> \| <Mc> \| <Other_ID_Continue>
`441`		- xid_start: <all characters in `id_start` whose NFKC normalization is
`442`		- in (`id_start` `xid_continue`*)">
`443`		- xid_continue: <all characters in `id_continue` whose NFKC normalization is
`444`		- in (`id_continue`*)">
`445`		- identifier: <`NAME`, except keywords>
	`405`	+ NAME: `name_start` `name_continue`*
	`406`	`+ name_start: "a"..."z" \| "A"..."Z" \| "_" \| <non-ASCII character>`
	`407`	`+ name_continue: name_start \| "0"..."9"`
	`408`	+ identifier: <`NAME`, except keywords>
`446`	`409`
`447`		`-A non-normative listing of all valid identifier characters as defined by`
`448`		-Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
`449`		`-Character Database.`
`450`		`-`
`451`		`-`
`452`		`-.. _UAX-31:https://www.unicode.org/reports/tr31/`
`453`		`-.. _PropList.txt:https://www.unicode.org/Public/17.0.0/ucd/PropList.txt`
`454`		`-.. _DerivedCoreProperties.txt:https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt`
`455`		`-.. _normalization form:https://www.unicode.org/reports/tr15/#Norm_Forms`
	`410`	`+Note that not all names matched by this grammar are valid; see`
	`411`	+:ref:`lexical-names-nonascii` for details.
`456`	`412`
`457`	`413`
`458`	`414`	`.. _keywords:`
`@@ -555,6 +511,95 @@ characters:`
`555`	`511`	:ref:`atom-identifiers`.
`556`	`512`
`557`	`513`
	`514`	`+.. _lexical-names-nonascii:`
	`515`	`+`
	`516`	`+Non-ASCII characters in names`
	`517`	`+-----------------------------`
	`518`	`+`
	`519`	`+Names that contain non-ASCII characters need additional normalization`
	`520`	`+and validation beyond the rules and grammar explained`
	`521`	+:ref:`above<identifiers>`.
	`522`	+For example, ``ř_1``, ``蛇``, or ``साँप`` are valid names, but ``r〰2``,
	`523`	+``€``, or ``🐍`` are not.
	`524`	`+`
	`525`	`+This section explains the exact rules.`
	`526`	`+`
	`527`	+All names are converted into the `normalization form`_ NFKC while parsing.
	`528`	`+This means that, for example, some typographic variants of characters are`
	`529`	+converted to their "basic" form. For example, ``ﬁⁿₐˡᵢᶻₐᵗᵢᵒₙ`` normalizes to
	`530`	+``finalization``, so Python treats them as the same name::
	`531`	`+`
	`532`	`+ >>> ﬁⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3`
	`533`	`+ >>> finalization`
	`534`	`+ 3`
	`535`	`+`
	`536`	`+..note::`
	`537`	`+`
	`538`	`+ Normalization is done at the lexical level only.`
	`539`	`+ Run-time functions that take names as strings generally do not normalize`
	`540`	`+ their arguments.`
	`541`	`+ For example, the variable defined above is accessible at run time in the`
	`542`	+:func:`globals` dictionary as ``globals()["finalization"]`` but not
	`543`	+ ``globals()["ﬁⁿₐˡᵢᶻₐᵗᵢᵒₙ"]``.
	`544`	`+`
	`545`	`+Similarly to how ASCII-only names must contain only letters, digits and`
	`546`	`+the underscore, and cannot start with a digit, a valid name must`
	`547`	+start with a character in the "letter-like" set ``xid_start``,
	`548`	`+and the remaining characters must be in the "letter- and digit-like" set`
	`549`	+``xid_continue``.
	`550`	`+`
	`551`	`+These sets based on the XID_Start and XID_Continue sets as defined by the`
	`552`	+Unicode standard annex `UAX-31`_.
	`553`	+Python's ``xid_start`` additionally includes the underscore (``_``).
	`554`	+Note that Python does not necessarily conform to `UAX-31`_.
	`555`	`+`
	`556`	`+A non-normative listing of characters in the XID_Start and XID_Continue`
	`557`	+sets as defined by Unicode is available in the `DerivedCoreProperties.txt`_
	`558`	`+file in the Unicode Character Database.`
	`559`	+For reference, the construction rules for the ``xid_*`` sets are given below.
	`560`	`+`
	`561`	+The set ``id_start`` is defined as the union of:
	`562`	`+`
	`563`	+* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
	`564`	+* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
	`565`	+* Unicode category ``<Lt>`` - titlecase letters
	`566`	+* Unicode category ``<Lm>`` - modifier letters
	`567`	+* Unicode category ``<Lo>`` - other letters
	`568`	+* Unicode category ``<Nl>`` - letter numbers
	`569`	+* {``"_"``} - the underscore
	`570`	+* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
	`571`	`+ to support backwards compatibility`
	`572`	`+`
	`573`	+The set ``xid_start`` then closes this set under NFKC normalization, by
	`574`	`+removing all characters whose normalization is not of the form`
	`575`	+``id_start id_continue*``.
	`576`	`+`
	`577`	+The set ``id_continue`` is defined as the union of:
	`578`	`+`
	`579`	+* ``id_start`` (see above)
	`580`	+* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
	`581`	+* Unicode category ``<Pc>`` - connector punctuations
	`582`	+* Unicode category ``<Mn>`` - nonspacing marks
	`583`	+* Unicode category ``<Mc>`` - spacing combining marks
	`584`	+* ``<Other_ID_Continue>`` - another explicit set of characters in
	`585`	+ `PropList.txt`_ to support backwards compatibility
	`586`	`+`
	`587`	+Again, ``xid_continue`` closes this set under NFKC normalization.
	`588`	`+`
	`589`	`+Unicode categories use the version of the Unicode Character Database as`
	`590`	+included in the:mod:`unicodedata` module.
	`591`	`+`
	`592`	`+.. _UAX-31:https://www.unicode.org/reports/tr31/`
	`593`	`+.. _PropList.txt:https://www.unicode.org/Public/17.0.0/ucd/PropList.txt`
	`594`	`+.. _DerivedCoreProperties.txt:https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt`
	`595`	`+.. _normalization form:https://www.unicode.org/reports/tr15/#Norm_Forms`
	`596`	`+`
	`597`	`+..seealso::`
	`598`	`+`
	`599`	+ *:pep:`3131` -- Supporting Non-ASCII Identifiers
	`600`	+ *:pep:`672` -- Unicode-related Security Considerations for Python
	`601`	`+`
	`602`	`+`
`558`	`603`	`.. _literals:`
`559`	`604`
`560`	`605`	`Literals`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit2ff8608

File tree

1 file changed

1 file changed

`‎Doc/reference/lexical_analysis.rst‎`

0 commit comments