Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
Description
Bug report
3.8 adds the.is_normalized
function to theunicodedata
module, which also is available as a method on the legacyunicodedata.ucd_3_2_0
database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.
However, the legacy version does not maintain the expected invariant. In fact, it reports thatevery single-character string isnot normalized,regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)
Example:
>>> import unicodedata>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'True>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')False>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))False
The bug appears to beat line 801-804 of unicodedata.c:
/* UCD 3.2.0 is requested, quickchecks must be disabled. */ if (UCD_Check(self)) { return NO; }
I believe theNO
should sayMAYBE
instead. TheNO
value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.
Your environment
$ pythonPython 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linuxType "help", "copyright", "credits" or "license" for more information.