Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
Description
Bug report
Bug description:
With theunicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character,but not in a few cases.
The method is to look atunicodedata.category(c).
A start character has category in"Lu Ll Lt Lm Lo Nl Pc".split().
A continue character has category in"Lu Ll Lt Lm Lo Mn Mc Nd Nl Pc".split().
However, there are several codepoints which don't match these criteria, either because they are not that type of character or because their category is different.
Here is a complete list of the exceptions, on Python 3.13 and Unicode version 16.0:
Should beXID_START but are not:
005f Pc True LOW LINE037a Lm True GREEK YPOGEGRAMMENI0e33 Lo True THAI CHARACTER SARA AM0eb3 Lo True LAO VOWEL SIGN AM203f Pc True UNDERTIE2040 Pc True CHARACTER TIE2054 Pc True INVERTED UNDERTIE2e2f Lm True VERTICAL TILDEfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORMfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORMfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORMfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORMfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORMfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORMfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAMfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOUfe33 Pc True PRESENTATION FORM FOR VERTICAL LOW LINEfe34 Pc True PRESENTATION FORM FOR VERTICAL WAVY LOW LINEfe4d Pc True DASHED LOW LINEfe4e Pc True CENTRELINE LOW LINEfe4f Pc True WAVY LOW LINEfe70 Lo True ARABIC FATHATAN ISOLATED FORMfe72 Lo True ARABIC DAMMATAN ISOLATED FORMfe74 Lo True ARABIC KASRATAN ISOLATED FORMfe76 Lo True ARABIC FATHA ISOLATED FORMfe78 Lo True ARABIC DAMMA ISOLATED FORMfe7a Lo True ARABIC KASRA ISOLATED FORMfe7c Lo True ARABIC SHADDA ISOLATED FORMfe7e Lo True ARABIC SUKUN ISOLATED FORMff3f Pc True FULLWIDTH LOW LINEff9e Lm True HALFWIDTH KATAKANA VOICED SOUND MARKff9f Lm True HALFWIDTH KATAKANA SEMI-VOICED SOUND MARKShould not beXID_START but are:
1885 Mn False MONGOLIAN LETTER ALI GALI BALUDA1886 Mn False MONGOLIAN LETTER ALI GALI THREE BALUDA2118 Sm False SCRIPT CAPITAL P212e So False ESTIMATED SYMBOLShould beXID_CONTINUE but are not:
037a Lm True GREEK YPOGEGRAMMENI2e2f Lm True VERTICAL TILDEfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORMfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORMfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORMfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORMfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORMfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORMfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAMfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOUfe70 Lo True ARABIC FATHATAN ISOLATED FORMfe72 Lo True ARABIC DAMMATAN ISOLATED FORMfe74 Lo True ARABIC KASRATAN ISOLATED FORMfe76 Lo True ARABIC FATHA ISOLATED FORMfe78 Lo True ARABIC DAMMA ISOLATED FORMfe7a Lo True ARABIC KASRA ISOLATED FORMfe7c Lo True ARABIC SHADDA ISOLATED FORMfe7e Lo True ARABIC SUKUN ISOLATED FORMShould not beXID_CONTINUE but are:
00b7 Po False MIDDLE DOT0387 Po False GREEK ANO TELEIA1369 No False ETHIOPIC DIGIT ONE136a No False ETHIOPIC DIGIT TWO136b No False ETHIOPIC DIGIT THREE136c No False ETHIOPIC DIGIT FOUR136d No False ETHIOPIC DIGIT FIVE136e No False ETHIOPIC DIGIT SIX136f No False ETHIOPIC DIGIT SEVEN1370 No False ETHIOPIC DIGIT EIGHT1371 No False ETHIOPIC DIGIT NINE19da No False NEW TAI LUE THAM DIGIT ONE200c Cf False ZERO WIDTH NON-JOINER200d Cf False ZERO WIDTH JOINER2118 Sm False SCRIPT CAPITAL P212e So False ESTIMATED SYMBOL30fb Po False KATAKANA MIDDLE DOTff65 Po False HALFWIDTH KATAKANA MIDDLE DOTMany of these exceptions are specified in the UAX#31 Section 5.1,NFKC Modifications.
Proposal
I suggest adding two functions to the module,unicodedata.isidstart(chr) andunicodedata.isidcontinue(chr). These returnTrue ifchr appears in theDerivedCoreProperties.txt file asXID_Start orXID_Continue,resp.
CPython versions tested on:
3.13
Operating systems tested on:
Windows