Diacritics are not considered part of words #101421

New issue

Open

Diacritics are not considered part of words#101421

Labels

stdlibPython modules in the Lib dirtopic-regextopic-unicodetype-bugAn unexpected behavior, bug, or error

Description

xmo-odoo

opened

on Jan 30, 2023

I'm not sure whether it's a bug or expected behaviour, but it seems odd so I figure reporting it is a good idea: while a precomposed character is considered "a word" by the regex engine (specifically\w), its decomposed form is not, because a diacritic is not considered part of a word.

>>>importre,unicodedata>>>s="ö">>>list(s)['ö']>>>list(unicodedata.normalize('NFD',s))['o','̈']>>>re.fullmatch(r'\w+',s)<re.Matchobject;span=(0,1),match='ö'>>>>re.fullmatch(r'\w+',unicodedata.normalize('NFD',s))

This leads to odd effects when ingesting and filtering decomposed data.

Tested on 3.8.13, 3.10.6, and 3.11.1 (all installed via pyenv), on a Mint 21.1).

Metadata

Assignees

No one assigned

Labels

stdlibPython modules in the Lib dirtopic-regextopic-unicodetype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Diacritics are not considered part of words #101421

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions