Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34k
gh-74902: add unicode grapheme cluster break algorithm#2673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
the-knights-who-say-ni commentedJul 11, 2017
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed thePSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username onbugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please followthe steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
0f82f82 to62fd6e0Compare62fd6e0 toa47de54CompareVermeille commentedAug 2, 2017
Hello? Someone here? |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Modules/unicodedata.c Outdated
| 0, /*tp_setattro*/ | ||
| 0, /*tp_as_buffer*/ | ||
| Py_TPFLAGS_DEFAULT, | ||
| "Internal grapheme cluster iterator object.", /* tp_doc */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the words "internal" and "object" are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?
Uh oh!
There was an error while loading.Please reload this page.
Vermeille commentedJan 11, 2018
Sorry for the long wait. Are we good concerning the changes? Anything to add? |
brettcannon commentedFeb 2, 2018
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
csabella commentedMay 23, 2020
@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks! |
This PR is stale because it has been open for 30 days with no activity. |
This PR is stale because it has been open for 30 days with no activity. |
This PR is stale because it has been open for 30 days with no activity. |
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I apologize that it took so long to start reviewing this PR seriously.
Now we need this algorithm to calculate the width of text in columns, which is needed to support wide characters in many parts of the stdlib (REPL, tracebacks, etc). So we will add its implementation anyway. If you are busy or have lost interest, I will finish this work myself (keeping your credit), but if you are still interested, I would be happy to work together.
I wonder, what is the source of the state machine table? Did you created it from the original rules or from the table in GraphemeBreakTest.html? Or copied it from other source? I afraid that it is outdated and only supports legacy grapheme clusters. I can fix this, but maybe you already have a ready solution?
| self: self | ||
| unistr: unicode | ||
| start: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It should be Py_ssize_t. Some other variables should be Py_ssize_t, not int.
| self: self | ||
| unistr: unicode | ||
| start: int = 0 | ||
| end: Py_ssize_t(c_default="PY_SSIZE_T_MAX - 1") = sys.maxsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It should be PY_SSIZE_T_MAX.
Although I am not sure that theend parameter is needed. The user can simply stop iteration at any time.
| @staticmethod | ||
| def check_version(testfile): | ||
| hdr = testfile.readline() | ||
| return unicodedata.unidata_version in hdr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What does the file header look like?
With string contains tests, I worry about things like"8.0" in "18.0" matching wrongly. Could the full line be compared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
# GraphemeBreakTest-17.0.0.txtWe have the same check for normalization tests.
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
AA-Turner commentedDec 22, 2025
Closing in favour of#143076. A |
Uh oh!
There was an error while loading.Please reload this page.
I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.
https://bugs.python.org/issue30717