Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-129117: Expose_PyUnicode_IsXidContinue/Start inunicodedata#140269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
vstinner merged 16 commits intopython:mainfromStanFromIreland:startcontinueid
Oct 30, 2025

Conversation

@StanFromIreland
Copy link
Member

@StanFromIrelandStanFromIreland commentedOct 17, 2025
edited by github-actionsbot
Loading

@vstinner
Copy link
Member

@StanFromIreland StanFromIreland requested a review from vstinner

I don't know these Unicode properties. The PR documentation doesn't help me:

ReturnTrue if the character has theXID_Start property

What does it meanXID_Start?

@StanFromIreland
Copy link
MemberAuthor

Ah no worries then. You can find their documentation inthis report, I can add a link to it in the docs.

@vstinner
Copy link
Member

In short, these functions check if a character is an identifier start or an identifier character according to Unicode TR31?

@StanFromIreland
Copy link
MemberAuthor

Yes.

Copy link
Member

@malemburgmalemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM now

@StanFromIreland
Copy link
MemberAuthor

Thanks for the reviews!

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The change is correct, but I'm not convinced that wehave to expose this feature in Python. It seems to be an Unicode feature which rarely used.

@malemburg
Copy link
Member

Have a look athttps://peps.python.org/pep-3131/ for why these are important to have.

vstinner reacted with thumbs up emoji

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

About function names, the Unicode annex has alsoID_Start andID_Continue. TheXID is a variant. Maybe we should keepx in the function names?

@StanFromIreland
Copy link
MemberAuthor

StanFromIreland commentedOct 29, 2025
edited
Loading

About function names, the Unicode annex has alsoID_Start andID_Continue.

Note that they explicitly recommend the "X" variants.

@malemburg
Copy link
Member

Maybe we should keepx in the function names?

You have a point there. Let's keep the "x" in "xid" for the functions to not cause confusion.

StanFromIreland reacted with thumbs up emoji

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

@vstinnervstinnerenabled auto-merge (squash)October 30, 2025 09:53
@vstinnervstinner merged commitdbe3950 intopython:mainOct 30, 2025
46 checks passed
@StanFromIrelandStanFromIreland deleted the startcontinueid branchOctober 30, 2025 10:21
@StanFromIreland
Copy link
MemberAuthor

Thanks for merging!

vstinner reacted with heart emoji

@encukou
Copy link
Member

Have a look athttps://peps.python.org/pep-3131/ for why these are important to have.

It's still not clear to me whyisidentifier() is not enough here.

Note that neitherPEP-3131 nor current Python use the Unicode definition ofXID_Start to determine identifiers -- they additionally allow the underscore:

>>>unicodedata.isxidstart('_')False>>>'_'.isidentifier()True

Also, there's an easier way to explain name parsing, which involves onlyid_start &id_continue, and not thexid variants (whose definitions are more complicated):#140464 (review)

@malemburg
Copy link
Member

Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

Note that you need to use the XID variants if you are working with NFKC normalized text. Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications

From unicodeobject.c:

_PyUnicode_ScanIdentifier(PyObject *self){    Py_ssize_t i;    Py_ssize_t len = PyUnicode_GET_LENGTH(self);    if (len == 0) {        /* an empty string is not a valid identifier */        return 0;    }    int kind = PyUnicode_KIND(self);    const void *data = PyUnicode_DATA(self);    Py_UCS4 ch = PyUnicode_READ(kind, data, 0);    /* PEP 3131 says that the first character must be in       XID_Start and subsequent characters in XID_Continue,       and for the ASCII range, the 2.x rules apply (i.e       start with letters and underscore, continue with       letters, digits, underscore). However, given the current       definition of XID_Start and XID_Continue, it is sufficient       to check just for these, except that _ must be allowed       as starting an identifier.  */    if (!_PyUnicode_IsXidStart(ch) && ch != 0x5F /* LOW LINE */) {        return 0;    }    for (i = 1; i < len; i++) {        ch = PyUnicode_READ(kind, data, i);        if (!_PyUnicode_IsXidContinue(ch)) {            return i;        }    }    return i;}

The underscore is a special exception added for Python.

StanFromIreland reacted with thumbs up emoji

@encukou
Copy link
Member

XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

OK, that's a valid reason. Thanks!
I'll add a note about_ to clear up confusion.

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

@malemburg
Copy link
Member

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties."

Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications for details.

Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants.

@encukou
Copy link
Member

encukou commentedNov 5, 2025
edited
Loading

AFAIK, these modifications are exactly what's covered by normalizing and checking the result.
For the first example there, a THAI CHARACTER SARA AM is aLo, which puts it inID_Start:

>>> unicodedata.category('\N{THAI CHARACTER SARA AM}')'Lo'

But normalizing turns it into 2 characters,Mn andLo:

>>> [(unicodedata.name(c), unicodedata.category(c)) for c in unicodedata.normalize('NFKC', '\N{THAI CHARACTER SARA AM}')][('THAI CHARACTER NIKHAHIT', 'Mn'), ('THAI CHARACTER SARA AA', 'Lo')]

The NIKHAHIT (Mn) is inID_Continue but notID_Start, which means the SARA AM can't start an identifier despite being a letter:

>>> '\N{THAI CHARACTER SARA AM}'.isidentifier()False

That is: using theXID propertiesbefore normalization will get you the same result as using theID onesafter normalization. IOW, you need to use the XID variants if you arenot working with NFKC normalized text.

@malemburg
Copy link
Member

That is: using theXID propertiesbefore normalization will get you the same result as using theID onesafter normalization. IOW, you need to use the XID variants if you arenot working with NFKC normalized text.

Rereading the section in the TR, you could be right in a way 🙂

It discusses closure under normalization and this essentially means that theisIdentifier() property should give the same results regardless of whether it is applied to normalized text or raw text.

Using the XID variants to implementisIdentifier() will get you this property.

Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same.

Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well.

In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (seehttps://www.unicode.org/reports/tr31/#Default_Identifier_Syntax andhttps://www.unicode.org/reports/tr31/#Migration).

In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9.

So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases.

StanFromIreland reacted with thumbs up emoji

@encukou
Copy link
Member

Ah! It all makes sense now. Thank you!

malemburg and StanFromIreland reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@vstinnervstinnervstinner approved these changes

@malemburgmalemburgmalemburg approved these changes

@AA-TurnerAA-TurnerAwaiting requested review from AA-TurnerAA-Turner is a code owner

@ezio-melottiezio-melottiAwaiting requested review from ezio-melotti

@erlend-aaslanderlend-aaslandAwaiting requested review from erlend-aaslanderlend-aasland is a code owner

@emmatypingemmatypingAwaiting requested review from emmatypingemmatyping is a code owner

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@StanFromIreland@vstinner@malemburg@encukou

[8]ページ先頭

©2009-2025 Movatter.jp