Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
Open
Description
I'm not sure whether it's a bug or expected behaviour, but it seems odd so I figure reporting it is a good idea: while a precomposed character is considered "a word" by the regex engine (specifically\w
), its decomposed form is not, because a diacritic is not considered part of a word.
>>>importre,unicodedata>>>s="ö">>>list(s)['ö']>>>list(unicodedata.normalize('NFD',s))['o','̈']>>>re.fullmatch(r'\w+',s)<re.Matchobject;span=(0,1),match='ö'>>>>re.fullmatch(r'\w+',unicodedata.normalize('NFD',s))
This leads to odd effects when ingesting and filtering decomposed data.
Tested on 3.8.13, 3.10.6, and 3.11.1 (all installed via pyenv), on a Mint 21.1).