Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3…#101388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
corona10 merged 5 commits intopython:mainfromcorona10:gh-101372
Feb 6, 2023

Conversation

corona10
Copy link
Member

@corona10corona10 commentedJan 28, 2023
edited by bedevere-bot
Loading

@corona10
Copy link
MemberAuthor

@serhiy-storchaka

All ranges of characters are candidates for testing.
So I decide to choose a sampling approach rather than choose specific cases.

Test script

importunicodedatawithopen('foo.out','w')asf:forxinrange(0x110000):forformin ('NFC','NFD','NFKC','NFKD'):norm=unicodedata.ucd_3_2_0.normalize(form,chr(x))ifnotunicodedata.ucd_3_2_0.is_normalized(form,norm):f.write(f'{str(x)},{form}\n')

AS-IS

(.oss) ➜  cpython git:(main) ✗ ./python.exe gh-101372.py(.oss) ➜  cpython git:(main) ✗ wc -l foo.out 4456448 foo.out

TO-BE

(.oss) ➜  cpython git:(gh-101372) ✗ ./python.exe gh-101372.py(.oss) ➜  cpython git:(gh-101372) ✗ wc -l foo.out       0 foo.out

@corona10
Copy link
MemberAuthor

@serhiy-storchaka I will merge this PR by next week, please let me know if there need some changes

@serhiy-storchaka
Copy link
Member

I am not happy with provided tests.

Testing all range of Unicode characters is slow (few seconds on my computer), it should be decorated with@requires_resource('cpu') if performed. Testing only small random sample can miss errors, and the test result will be hardly reproducible. It works for this issue, becauseis_normalized() was broken for most of characters, but it could not work for other types of bugs.

The test for multicharacter string is not what I meant. It should not only test all normalized sequences, but also non-normalized sequences. For example,'\ufb2c' is normalized to'\u05e9\u05bc\u05c1'. Therefore,'\ufb2c' should be not normalized, and'\u05e9\u05bc\u05c1' should be normalized. But'\u05e9\u05c1\u05bc', created by swapping the last two characters, is normalized to the same sequence'\u05e9\u05bc\u05c1', therefore it should be not normalized, besides it looks exactly the same as the original character. I think we need such kind of tests.

I tried to write more interesting tests foris_normalized(), and have found that the UCD 3.2.0 is mostly not tested. Also, there are not many tests for differences between UCD 3.2.0 and the current version. I am writing new tests.

I propose to merge your PR without tests. The bugfix itself is obvious, and the tests I will add later.

@corona10
Copy link
MemberAuthor

corona10 commentedFeb 6, 2023
edited
Loading

I propose to merge your PR without tests. The bugfix itself is obvious, and the tests I will add later.

Okay got it, Please let me know once you submit the patch for test codes. I may learn a lot from the patch.

@miss-islington
Copy link
Contributor

Thanks@corona10 for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10, 3.11.
🐍🍒⛏🤖

@corona10corona10 deleted the gh-101372 branchFebruary 6, 2023 04:58
@bedevere-bot
Copy link

GH-101597 is a backport of this pull request to the3.11 branch.

@bedevere-botbedevere-bot removed the needs backport to 3.11only security fixes labelFeb 6, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this pull requestFeb 6, 2023
… UCD 3… (pythongh-101388)(cherry picked from commit9ef7e75)Co-authored-by: Dong-hee Na <donghee.na@python.org>
@bedevere-bot
Copy link

GH-101598 is a backport of this pull request to the3.10 branch.

@bedevere-botbedevere-bot removed the needs backport to 3.10only security fixes labelFeb 6, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this pull requestFeb 6, 2023
… UCD 3… (pythongh-101388)(cherry picked from commit9ef7e75)Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this pull requestFeb 6, 2023
gh-101388)(cherry picked from commit9ef7e75)Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this pull requestFeb 6, 2023
gh-101388)(cherry picked from commit9ef7e75)Co-authored-by: Dong-hee Na <donghee.na@python.org>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@benjaminpbenjaminpAwaiting requested review from benjaminp

@serhiy-storchakaserhiy-storchakaAwaiting requested review from serhiy-storchaka

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@corona10@serhiy-storchaka@miss-islington@bedevere-bot

[8]ページ先頭

©2009-2025 Movatter.jp