NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.8k

Commitc7364f7

authored

gh-127833: lexical analysis: Improve section on Names (GH-131474)

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>Co-authored-by: Blaise Pabon <blaise@gmail.com>

1 parent109f759 commitc7364f7Copy full SHA for c7364f7

File tree

2 files changed

+77

-53

lines changed

Doc/reference
- lexical_analysis.rst
Tools/unicode
- makeunicodedata.py

2 files changed

+77

-53

lines changed

`‎Doc/reference/lexical_analysis.rst`

Lines changed: 76 additions & 52 deletions

Original file line number	Diff line number	Diff line change
`@@ -288,58 +288,81 @@ forms a legal token, when read from left to right.`
`288`	`288`
`289`	`289`	`.. _identifiers:`
`290`	`290`
`291`		`-Identifiersand keywords`
`292`		`-========================`
	`291`	`+Names (identifiersand keywords)`
	`292`	`+================================`
`293`	`293`
`294`	`294`	`..index::identifier, name`
`295`	`295`
`296`		`-Identifiers (also referred to as names) are described by the following lexical`
`297`		`-definitions.`
	`296`	+:data:`~token.NAME` tokens represent identifiers, keywords, and
	`297`	`+soft keywords.`
`298`	`298`
`299`		`-The syntax of identifiers in Python is based on the Unicode standard annex`
`300`		-UAX-31, with elaboration and changes as defined below; see also:pep:`3131` for
`301`		`-further details.`
`302`		`-`
`303`		`-Within the ASCII range (U+0001..U+007F), the valid characters for identifiers`
`304`		-include the uppercase and lowercase letters ``A`` through
`305`		-``Z``, the underscore ``_`` and, except for the first character, the digits
	`299`	`+Within the ASCII range (U+0001..U+007F), the valid characters for names`
	`300`	+include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
	`301`	+the underscore ``_`` and, except for the first character, the digits
`306`	`302`	``0`` through ``9``.
`307`		`-Python 3.0 introduced additional characters from outside the ASCII range (see`
`308`		-:pep:`3131`). For these characters, the classification uses the version of the
`309`		-Unicode Character Database as included in the:mod:`unicodedata` module.
`310`	`303`
`311`		`-Identifiers are unlimited in length. Case is significant.`
	`304`	`+Names must contain at least one character, but have no upper length limit.`
	`305`	`+Case is significant.`
`312`	`306`
`313`		`-..productionlist::python-grammar`
`314`		- identifier: `xid_start` `xid_continue`*
`315`		`- id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>`
`316`		- id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
`317`		- xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
`318`		- xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">
`319`		`-`
`320`		`-The Unicode category codes mentioned above stand for:`
`321`		`-`
`322`		`-* Lu - uppercase letters`
`323`		`-* Ll - lowercase letters`
`324`		`-* Lt - titlecase letters`
`325`		`-* Lm - modifier letters`
`326`		`-* Lo - other letters`
`327`		`-* Nl - letter numbers`
`328`		`-* Mn - nonspacing marks`
`329`		`-* Mc - spacing combining marks`
`330`		`-* Nd - decimal numbers`
`331`		`-* Pc - connector punctuations`
`332`		-* Other_ID_Start - explicit list of characters in `PropList.txt
`333`		-<https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards
`334`		`- compatibility`
`335`		`-* Other_ID_Continue - likewise`
`336`		`-`
`337`		`-All identifiers are converted into the normal form NFKC while parsing; comparison`
`338`		`-of identifiers is based on NFKC.`
`339`		`-`
`340`		`-A non-normative HTML file listing all valid identifier characters for Unicode`
`341`		`-16.0.0 can be found at`
`342`		`-https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt`
	`307`	+Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
	`308`	`+and "number-like" characters from outside the ASCII range, as detailed below.`
	`309`	`+`
	`310`	+All identifiers are converted into the `normalization form`_ NFKC while
	`311`	`+parsing; comparison of identifiers is based on NFKC.`
	`312`	`+`
	`313`	`+Formally, the first character of a normalized identifier must belong to the`
	`314`	+set ``id_start``, which is the union of:
	`315`	`+`
	`316`	+* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
	`317`	+* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
	`318`	+* Unicode category ``<Lt>`` - titlecase letters
	`319`	+* Unicode category ``<Lm>`` - modifier letters
	`320`	+* Unicode category ``<Lo>`` - other letters
	`321`	+* Unicode category ``<Nl>`` - letter numbers
	`322`	+* {``"_"``} - the underscore
	`323`	+* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
	`324`	`+ to support backwards compatibility`
	`325`	`+`
	`326`	+The remaining characters must belong to the set ``id_continue``, which is the
	`327`	`+union of:`
	`328`	`+`
	`329`	+* all characters in ``id_start``
	`330`	+* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
	`331`	+* Unicode category ``<Pc>`` - connector punctuations
	`332`	+* Unicode category ``<Mn>`` - nonspacing marks
	`333`	+* Unicode category ``<Mc>`` - spacing combining marks
	`334`	+* ``<Other_ID_Continue>`` - another explicit set of characters in
	`335`	+ `PropList.txt`_ to support backwards compatibility
	`336`	`+`
	`337`	`+Unicode categories use the version of the Unicode Character Database as`
	`338`	+included in the:mod:`unicodedata` module.
	`339`	`+`
	`340`	+These sets are based on the Unicode standard annex `UAX-31`_.
	`341`	+See also:pep:`3131` for further details.
	`342`	`+`
	`343`	`+Even more formally, names are described by the following lexical definitions:`
	`344`	`+`
	`345`	`+..grammar-snippet::`
	`346`	`+:group: python-grammar`
	`347`	`+`
	`348`	+ NAME: `xid_start` `xid_continue`*
	`349`	`+ id_start: <Lu> \| <Ll> \| <Lt> \| <Lm> \| <Lo> \| <Nl> \| "_" \| <Other_ID_Start>`
	`350`	+ id_continue: `id_start` \| <Nd> \| <Pc> \| <Mn> \| <Mc> \| <Other_ID_Continue>
	`351`	+ xid_start: <all characters in `id_start` whose NFKC normalization is
	`352`	+ in (`id_start` `xid_continue`*)">
	`353`	+ xid_continue: <all characters in `id_continue` whose NFKC normalization is
	`354`	+ in (`id_continue`*)">
	`355`	+ identifier: <`NAME`, except keywords>
	`356`	`+`
	`357`	`+A non-normative listing of all valid identifier characters as defined by`
	`358`	+Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
	`359`	`+Character Database.`
	`360`	`+`
	`361`	`+`
	`362`	`+.. _UAX-31:https://www.unicode.org/reports/tr31/`
	`363`	`+.. _PropList.txt:https://www.unicode.org/Public/16.0.0/ucd/PropList.txt`
	`364`	`+.. _DerivedCoreProperties.txt:https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt`
	`365`	`+.. _normalization form:https://www.unicode.org/reports/tr15/#Norm_Forms`
`343`	`366`
`344`	`367`
`345`	`368`	`.. _keywords:`
`@@ -351,7 +374,7 @@ Keywords`
`351`	`374`	`single: keyword`
`352`	`375`	`single: reserved word`
`353`	`376`
`354`		`-The followingidentifiers are used as reserved words, or keywords of the`
	`377`	`+The followingnames are used as reserved words, or keywords of the`
`355`	`378`	`language, and cannot be used as ordinary identifiers. They must be spelled`
`356`	`379`	`exactly as written here:`
`357`	`380`
`@@ -375,18 +398,19 @@ Soft Keywords`
`375`	`398`
`376`	`399`	`..versionadded::3.10`
`377`	`400`
`378`		`-Some identifiers are only reserved under specific contexts. These are known as`
`379`		-soft keywords. The identifiers ``match``, ``case``, ``type`` and ``_`` can
`380`		`-syntactically act as keywords in certain contexts,`
	`401`	`+Some names are only reserved under specific contexts. These are known as`
	`402`	`+soft keywords:`
	`403`	`+`
	`404`	+- ``match``, ``case``, and ``_``, when used in the:keyword:`match` statement.
	`405`	+- ``type``, when used in the:keyword:`type` statement.
	`406`	`+`
	`407`	`+These syntactically act as keywords in their specific contexts,`
`381`	`408`	`but this distinction is done at the parser level, not when tokenizing.`
`382`	`409`
`383`	`410`	`As soft keywords, their use in the grammar is possible while still`
`384`	`411`	`preserving compatibility with existing code that uses these names as`
`385`	`412`	`identifier names.`
`386`	`413`
`387`		-``match``, ``case``, and ``_`` are used in the:keyword:`match` statement.
`388`		-``type`` is used in the:keyword:`type` statement.
`389`		`-`
`390`	`414`	`..versionchanged::3.12`
`391`	`415`	``type`` is now a soft keyword.
`392`	`416`

`‎Tools/unicode/makeunicodedata.py`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -43,7 +43,7 @@`
`43`	`43`	`# When changing UCD version please update`
`44`	`44`	`# * Doc/library/stdtypes.rst, and`
`45`	`45`	`# * Doc/library/unicodedata.rst`
`46`		`-# * Doc/reference/lexical_analysis.rst (two occurrences)`
	`46`	`+# * Doc/reference/lexical_analysis.rst (three occurrences)`
`47`	`47`	`UNIDATA_VERSION="16.0.0"`
`48`	`48`	`UNICODE_DATA="UnicodeData%s.txt"`
`49`	`49`	`COMPOSITION_EXCLUSIONS="CompositionExclusions%s.txt"`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commitc7364f7

File tree

2 files changed

2 files changed

`‎Doc/reference/lexical_analysis.rst`

`‎Tools/unicode/makeunicodedata.py`

0 commit comments