Note that neitherPEP-3131 nor current Python use the Unicode definition ofXID_Start to determine identifiers -- they additionally allow the underscore:

>>>unicodedata.isxidstart('_')False>>>'_'.isidentifier()True

Also, there's an easier way to explain name parsing, which involves onlyid_start &id_continue, and not thexid variants (whose definitions are more complicated):#140464 (review)

Copy link

Member

malemburg commentedNov 4, 2025

Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

Note that you need to use the XID variants if you are working with NFKC normalized text. Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications

From unicodeobject.c:

_PyUnicode_ScanIdentifier(PyObject *self){    Py_ssize_t i;    Py_ssize_t len = PyUnicode_GET_LENGTH(self);    if (len == 0) {        /* an empty string is not a valid identifier */        return 0;    }    int kind = PyUnicode_KIND(self);    const void *data = PyUnicode_DATA(self);    Py_UCS4 ch = PyUnicode_READ(kind, data, 0);    /* PEP 3131 says that the first character must be in       XID_Start and subsequent characters in XID_Continue,       and for the ASCII range, the 2.x rules apply (i.e       start with letters and underscore, continue with       letters, digits, underscore). However, given the current       definition of XID_Start and XID_Continue, it is sufficient       to check just for these, except that _ must be allowed       as starting an identifier.  */    if (!_PyUnicode_IsXidStart(ch) && ch != 0x5F /* LOW LINE */) {        return 0;    }    for (i = 1; i < len; i++) {        ch = PyUnicode_READ(kind, data, i);        if (!_PyUnicode_IsXidContinue(ch)) {            return i;        }    }    return i;}

The underscore is a special exception added for Python.

Copy link

Member

encukou commentedNov 5, 2025

XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

OK, that's a valid reason. Thanks!
I'll add a note about_ to clear up confusion.

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

Copy link

Member

malemburg commentedNov 5, 2025

Note that you need to use the XID variants if you are working with NFKC normalized text.
Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties."

Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications for details.

Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants.

Copy link

Member

encukou commentedNov 5, 2025•
edited
Loading

AFAIK, these modifications are exactly what's covered by normalizing and checking the result.
For the first example there, a THAI CHARACTER SARA AM is aLo, which puts it inID_Start:

>>> unicodedata.category('\N{THAI CHARACTER SARA AM}')'Lo'

But normalizing turns it into 2 characters,Mn andLo:

>>> [(unicodedata.name(c), unicodedata.category(c)) for c in unicodedata.normalize('NFKC', '\N{THAI CHARACTER SARA AM}')][('THAI CHARACTER NIKHAHIT', 'Mn'), ('THAI CHARACTER SARA AA', 'Lo')]

The NIKHAHIT (Mn) is inID_Continue but notID_Start, which means the SARA AM can't start an identifier despite being a letter:

>>> '\N{THAI CHARACTER SARA AM}'.isidentifier()False

That is: using theXID propertiesbefore normalization will get you the same result as using theID onesafter normalization. IOW, you need to use the XID variants if you arenot working with NFKC normalized text.

Copy link

Member

malemburg commentedNov 5, 2025

That is: using theXID propertiesbefore normalization will get you the same result as using theID onesafter normalization. IOW, you need to use the XID variants if you arenot working with NFKC normalized text.

Rereading the section in the TR, you could be right in a way 🙂

It discusses closure under normalization and this essentially means that theisIdentifier() property should give the same results regardless of whether it is applied to normalized text or raw text.

Using the XID variants to implementisIdentifier() will get you this property.

Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same.

Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well.

In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (seehttps://www.unicode.org/reports/tr31/#Default_Identifier_Syntax andhttps://www.unicode.org/reports/tr31/#Migration).

In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9.

So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases.

Copy link

Member

encukou commentedNov 5, 2025

Ah! It all makes sense now. Thank you!

encukou mentioned this pull request

Nov 5, 2025

gh-135676: Simplify docs on lexing names#140464

Draft

Labels

None yet

Movatterモバイル変換

Uh oh!

gh-129117: Expose_PyUnicode_IsXidContinue/Start inunicodedata#140269

gh-129117: Expose_PyUnicode_IsXidContinue/Start inunicodedata#140269

Uh oh!

Conversation

StanFromIreland commentedOct 17, 2025• edited by github-actionsbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vstinner commentedOct 28, 2025

Uh oh!

StanFromIreland commentedOct 28, 2025

Uh oh!

vstinner commentedOct 28, 2025

Uh oh!

StanFromIreland commentedOct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

StanFromIreland commentedOct 29, 2025

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

malemburg commentedOct 29, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

StanFromIreland commentedOct 29, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

malemburg commentedOct 29, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commentedOct 30, 2025

Uh oh!

encukou commentedNov 4, 2025

Uh oh!

malemburg commentedNov 4, 2025

Uh oh!

encukou commentedNov 5, 2025

Uh oh!

malemburg commentedNov 5, 2025

Uh oh!

encukou commentedNov 5, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

malemburg commentedNov 5, 2025

Uh oh!

encukou commentedNov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gh-129117: Expose`_PyUnicode_IsXidContinue/Start` in`unicodedata`#140269

gh-129117: Expose`_PyUnicode_IsXidContinue/Start` in`unicodedata`#140269

StanFromIreland commentedOct 17, 2025•
edited by github-actionsbot
Loading

StanFromIreland commentedOct 29, 2025•
edited
Loading

encukou commentedNov 5, 2025•
edited
Loading