Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit2ff8608

Browse files
encukouStanFromIrelandblaisepMichaByteKeithTheEE
authored
gh-135676: Simplify docs on lexing names (GH-140464)
This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section.It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but:- parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators- normalizes the name- validates the name, using the xid_start/xid_continue setsCo-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>Co-authored-by: Blaise Pabon <blaise@gmail.com>Co-authored-by: Micha Albert <info@micha.zone>Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
1 parentc359ea4 commit2ff8608

File tree

1 file changed

+103
-58
lines changed

1 file changed

+103
-58
lines changed

‎Doc/reference/lexical_analysis.rst‎

Lines changed: 103 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -386,73 +386,29 @@ Names (identifiers and keywords)
386386
:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
387387
*soft keywords*.
388388

389-
Within the ASCII range (U+0001..U+007F), the valid characters for names
390-
include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
391-
the underscore ``_`` and, except for the first character, the digits
392-
``0`` through ``9``.
389+
Names are composed of the following characters:
390+
391+
* uppercase and lowercase letters (``A-Z`` and ``a-z``),
392+
* the underscore (``_``),
393+
* digits (``0`` through ``9``), which cannot appear as the first character, and
394+
* non-ASCII characters. Valid names may only contain "letter-like" and
395+
"digit-like" characters; see:ref:`lexical-names-nonascii` for details.
393396

394397
Names must contain at least one character, but have no upper length limit.
395398
Case is significant.
396399

397-
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
398-
and "number-like" characters from outside the ASCII range, as detailed below.
399-
400-
All identifiers are converted into the `normalization form`_ NFKC while
401-
parsing; comparison of identifiers is based on NFKC.
402-
403-
Formally, the first character of a normalized identifier must belong to the
404-
set ``id_start``, which is the union of:
405-
406-
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
407-
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
408-
* Unicode category ``<Lt>`` - titlecase letters
409-
* Unicode category ``<Lm>`` - modifier letters
410-
* Unicode category ``<Lo>`` - other letters
411-
* Unicode category ``<Nl>`` - letter numbers
412-
* {``"_"``} - the underscore
413-
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
414-
to support backwards compatibility
415-
416-
The remaining characters must belong to the set ``id_continue``, which is the
417-
union of:
418-
419-
* all characters in ``id_start``
420-
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
421-
* Unicode category ``<Pc>`` - connector punctuations
422-
* Unicode category ``<Mn>`` - nonspacing marks
423-
* Unicode category ``<Mc>`` - spacing combining marks
424-
* ``<Other_ID_Continue>`` - another explicit set of characters in
425-
`PropList.txt`_ to support backwards compatibility
426-
427-
Unicode categories use the version of the Unicode Character Database as
428-
included in the:mod:`unicodedata` module.
429-
430-
These sets are based on the Unicode standard annex `UAX-31`_.
431-
See also:pep:`3131` for further details.
432-
433-
Even more formally, names are described by the following lexical definitions:
400+
Formally, names are described by the following lexical definitions:
434401

435402
..grammar-snippet::
436403
:group: python-grammar
437404

438-
NAME: `xid_start` `xid_continue`*
439-
id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
440-
id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
441-
xid_start: <all characters in `id_start` whose NFKC normalization is
442-
in (`id_start` `xid_continue`*)">
443-
xid_continue: <all characters in `id_continue` whose NFKC normalization is
444-
in (`id_continue`*)">
445-
identifier: <`NAME`, except keywords>
405+
NAME: `name_start` `name_continue`*
406+
name_start: "a"..."z" | "A"..."Z" | "_" | <non-ASCII character>
407+
name_continue: name_start | "0"..."9"
408+
identifier: <`NAME`, except keywords>
446409

447-
A non-normative listing of all valid identifier characters as defined by
448-
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
449-
Character Database.
450-
451-
452-
.. _UAX-31:https://www.unicode.org/reports/tr31/
453-
.. _PropList.txt:https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
454-
.. _DerivedCoreProperties.txt:https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
455-
.. _normalization form:https://www.unicode.org/reports/tr15/#Norm_Forms
410+
Note that not all names matched by this grammar are valid; see
411+
:ref:`lexical-names-nonascii` for details.
456412

457413

458414
.. _keywords:
@@ -555,6 +511,95 @@ characters:
555511
:ref:`atom-identifiers`.
556512

557513

514+
.. _lexical-names-nonascii:
515+
516+
Non-ASCII characters in names
517+
-----------------------------
518+
519+
Names that contain non-ASCII characters need additional normalization
520+
and validation beyond the rules and grammar explained
521+
:ref:`above<identifiers>`.
522+
For example, ``ř_1``, ````, or ``साँप`` are valid names, but ``r〰2``,
523+
````, or ``🐍`` are not.
524+
525+
This section explains the exact rules.
526+
527+
All names are converted into the `normalization form`_ NFKC while parsing.
528+
This means that, for example, some typographic variants of characters are
529+
converted to their "basic" form. For example, ``fiⁿₐˡᵢᶻₐᵗᵢᵒₙ`` normalizes to
530+
``finalization``, so Python treats them as the same name::
531+
532+
>>> fiⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3
533+
>>> finalization
534+
3
535+
536+
..note::
537+
538+
Normalization is done at the lexical level only.
539+
Run-time functions that take names as *strings* generally do not normalize
540+
their arguments.
541+
For example, the variable defined above is accessible at run time in the
542+
:func:`globals` dictionary as ``globals()["finalization"]`` but not
543+
``globals()["fiⁿₐˡᵢᶻₐᵗᵢᵒₙ"]``.
544+
545+
Similarly to how ASCII-only names must contain only letters, digits and
546+
the underscore, and cannot start with a digit, a valid name must
547+
start with a character in the "letter-like" set ``xid_start``,
548+
and the remaining characters must be in the "letter- and digit-like" set
549+
``xid_continue``.
550+
551+
These sets based on the *XID_Start* and *XID_Continue* sets as defined by the
552+
Unicode standard annex `UAX-31`_.
553+
Python's ``xid_start`` additionally includes the underscore (``_``).
554+
Note that Python does not necessarily conform to `UAX-31`_.
555+
556+
A non-normative listing of characters in the *XID_Start* and *XID_Continue*
557+
sets as defined by Unicode is available in the `DerivedCoreProperties.txt`_
558+
file in the Unicode Character Database.
559+
For reference, the construction rules for the ``xid_*`` sets are given below.
560+
561+
The set ``id_start`` is defined as the union of:
562+
563+
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
564+
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
565+
* Unicode category ``<Lt>`` - titlecase letters
566+
* Unicode category ``<Lm>`` - modifier letters
567+
* Unicode category ``<Lo>`` - other letters
568+
* Unicode category ``<Nl>`` - letter numbers
569+
* {``"_"``} - the underscore
570+
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
571+
to support backwards compatibility
572+
573+
The set ``xid_start`` then closes this set under NFKC normalization, by
574+
removing all characters whose normalization is not of the form
575+
``id_start id_continue*``.
576+
577+
The set ``id_continue`` is defined as the union of:
578+
579+
* ``id_start`` (see above)
580+
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
581+
* Unicode category ``<Pc>`` - connector punctuations
582+
* Unicode category ``<Mn>`` - nonspacing marks
583+
* Unicode category ``<Mc>`` - spacing combining marks
584+
* ``<Other_ID_Continue>`` - another explicit set of characters in
585+
`PropList.txt`_ to support backwards compatibility
586+
587+
Again, ``xid_continue`` closes this set under NFKC normalization.
588+
589+
Unicode categories use the version of the Unicode Character Database as
590+
included in the:mod:`unicodedata` module.
591+
592+
.. _UAX-31:https://www.unicode.org/reports/tr31/
593+
.. _PropList.txt:https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
594+
.. _DerivedCoreProperties.txt:https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
595+
.. _normalization form:https://www.unicode.org/reports/tr15/#Norm_Forms
596+
597+
..seealso::
598+
599+
*:pep:`3131` -- Supporting Non-ASCII Identifiers
600+
*:pep:`672` -- Unicode-related Security Considerations for Python
601+
602+
558603
.. _literals:
559604

560605
Literals

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp