Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.4k
gh-129117: Expose_PyUnicode_IsXidContinue/Start inunicodedata#140269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
vstinner commentedOct 28, 2025
I don't know these Unicode properties. The PR documentation doesn't help me:
What does it mean |
StanFromIreland commentedOct 28, 2025
Ah no worries then. You can find their documentation inthis report, I can add a link to it in the docs. |
vstinner commentedOct 28, 2025
In short, these functions check if a character is an identifier start or an identifier character according to Unicode TR31? |
StanFromIreland commentedOct 28, 2025
Yes. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This reverts commitb24b994.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
malemburg left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM now
Uh oh!
There was an error while loading.Please reload this page.
StanFromIreland commentedOct 29, 2025
Thanks for the reviews! |
Uh oh!
There was an error while loading.Please reload this page.
vstinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The change is correct, but I'm not convinced that wehave to expose this feature in Python. It seems to be an Unicode feature which rarely used.
malemburg commentedOct 29, 2025
Have a look athttps://peps.python.org/pep-3131/ for why these are important to have. |
vstinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
About function names, the Unicode annex has alsoID_Start andID_Continue. TheXID is a variant. Maybe we should keepx in the function names?
StanFromIreland commentedOct 29, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Note that they explicitly recommend the "X" variants. |
malemburg commentedOct 29, 2025
You have a point there. Let's keep the "x" in "xid" for the functions to not cause confusion. |
vstinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
dbe3950 intopython:mainUh oh!
There was an error while loading.Please reload this page.
StanFromIreland commentedOct 30, 2025
Thanks for merging! |
encukou commentedNov 4, 2025
It's still not clear to me why Note that neitherPEP-3131 nor current Python use the Unicode definition ofXID_Start to determine identifiers -- they additionally allow the underscore: >>>unicodedata.isxidstart('_')False>>>'_'.isidentifier()True Also, there's an easier way to explain name parsing, which involves only |
malemburg commentedNov 4, 2025
Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions. Note that you need to use the XID variants if you are working with NFKC normalized text. Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications From unicodeobject.c: The underscore is a special exception added for Python. |
encukou commentedNov 5, 2025
OK, that's a valid reason. Thanks!
Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.) |
malemburg commentedNov 5, 2025
No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties." Seehttps://www.unicode.org/reports/tr31/#NFKC_Modifications for details. Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants. |
encukou commentedNov 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
AFAIK, these modifications are exactly what's covered by normalizing and checking the result. But normalizing turns it into 2 characters, The NIKHAHIT ( That is: using theXID propertiesbefore normalization will get you the same result as using theID onesafter normalization. IOW, you need to use the XID variants if you arenot working with NFKC normalized text. |
malemburg commentedNov 5, 2025
Rereading the section in the TR, you could be right in a way 🙂 It discusses closure under normalization and this essentially means that the Using the XID variants to implement Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same. Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well. In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (seehttps://www.unicode.org/reports/tr31/#Default_Identifier_Syntax andhttps://www.unicode.org/reports/tr31/#Migration). In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9. So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases. |
encukou commentedNov 5, 2025
Ah! It all makes sense now. Thank you! |
Uh oh!
There was an error while loading.Please reload this page.
📚 Documentation preview 📚:https://cpython-previews--140269.org.readthedocs.build/