Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. #129117

Closed
@mrolle45

Description

@mrolle45

Bug report

Bug description:

With theunicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character,but not in a few cases.
The method is to look atunicodedata.category(c).
A start character has category in"Lu Ll Lt Lm Lo Nl Pc".split().
A continue character has category in"Lu Ll Lt Lm Lo Mn Mc Nd Nl Pc".split().

However, there are several codepoints which don't match these criteria, either because they are not that type of character or because their category is different.
Here is a complete list of the exceptions, on Python 3.13 and Unicode version 16.0:
Should beXID_START but are not:

005f Pc True LOW LINE037a Lm True GREEK YPOGEGRAMMENI0e33 Lo True THAI CHARACTER SARA AM0eb3 Lo True LAO VOWEL SIGN AM203f Pc True UNDERTIE2040 Pc True CHARACTER TIE2054 Pc True INVERTED UNDERTIE2e2f Lm True VERTICAL TILDEfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORMfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORMfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORMfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORMfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORMfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORMfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAMfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOUfe33 Pc True PRESENTATION FORM FOR VERTICAL LOW LINEfe34 Pc True PRESENTATION FORM FOR VERTICAL WAVY LOW LINEfe4d Pc True DASHED LOW LINEfe4e Pc True CENTRELINE LOW LINEfe4f Pc True WAVY LOW LINEfe70 Lo True ARABIC FATHATAN ISOLATED FORMfe72 Lo True ARABIC DAMMATAN ISOLATED FORMfe74 Lo True ARABIC KASRATAN ISOLATED FORMfe76 Lo True ARABIC FATHA ISOLATED FORMfe78 Lo True ARABIC DAMMA ISOLATED FORMfe7a Lo True ARABIC KASRA ISOLATED FORMfe7c Lo True ARABIC SHADDA ISOLATED FORMfe7e Lo True ARABIC SUKUN ISOLATED FORMff3f Pc True FULLWIDTH LOW LINEff9e Lm True HALFWIDTH KATAKANA VOICED SOUND MARKff9f Lm True HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

Should not beXID_START but are:

1885 Mn False MONGOLIAN LETTER ALI GALI BALUDA1886 Mn False MONGOLIAN LETTER ALI GALI THREE BALUDA2118 Sm False SCRIPT CAPITAL P212e So False ESTIMATED SYMBOL

Should beXID_CONTINUE but are not:

037a Lm True GREEK YPOGEGRAMMENI2e2f Lm True VERTICAL TILDEfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORMfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORMfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORMfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORMfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORMfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORMfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAMfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOUfe70 Lo True ARABIC FATHATAN ISOLATED FORMfe72 Lo True ARABIC DAMMATAN ISOLATED FORMfe74 Lo True ARABIC KASRATAN ISOLATED FORMfe76 Lo True ARABIC FATHA ISOLATED FORMfe78 Lo True ARABIC DAMMA ISOLATED FORMfe7a Lo True ARABIC KASRA ISOLATED FORMfe7c Lo True ARABIC SHADDA ISOLATED FORMfe7e Lo True ARABIC SUKUN ISOLATED FORM

Should not beXID_CONTINUE but are:

00b7 Po False MIDDLE DOT0387 Po False GREEK ANO TELEIA1369 No False ETHIOPIC DIGIT ONE136a No False ETHIOPIC DIGIT TWO136b No False ETHIOPIC DIGIT THREE136c No False ETHIOPIC DIGIT FOUR136d No False ETHIOPIC DIGIT FIVE136e No False ETHIOPIC DIGIT SIX136f No False ETHIOPIC DIGIT SEVEN1370 No False ETHIOPIC DIGIT EIGHT1371 No False ETHIOPIC DIGIT NINE19da No False NEW TAI LUE THAM DIGIT ONE200c Cf False ZERO WIDTH NON-JOINER200d Cf False ZERO WIDTH JOINER2118 Sm False SCRIPT CAPITAL P212e So False ESTIMATED SYMBOL30fb Po False KATAKANA MIDDLE DOTff65 Po False HALFWIDTH KATAKANA MIDDLE DOT

Many of these exceptions are specified in the UAX#31 Section 5.1,NFKC Modifications.

Proposal

I suggest adding two functions to the module,unicodedata.isidstart(chr) andunicodedata.isidcontinue(chr). These returnTrue ifchr appears in theDerivedCoreProperties.txt file asXID_Start orXID_Continue,resp.

CPython versions tested on:

3.13

Operating systems tested on:

Windows

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp