- Notifications
You must be signed in to change notification settings - Fork61
Unicode sentence boundaries#24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Passes all tests in the examples provided here:http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt
Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon! Sentence boundaries is something I've wanted implemented here for a while 😄 |
(still no time to look at this, apologies. Really hope to get to it soon) |
This PR looks quite good and it would be great to have this functionality. Could I help in any way ? |
@rth I can split this out into another crate if required? |
@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) inrth/vtext#51 . Having this implementation in the Any chance@Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it? |
@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it. |
Sure, I'll try to review it in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Code looks correct! Mostly want more documentation.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
@Manishearth@rth I have updated the PR including requested changes |
Looks good!@rth want to do a second review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks a lot for the review@Manishearth !
I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).
I can confirm thatsrc/tables.rs
andsrc/testdata.rs
in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py
- os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"+ os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"
as otherwise data for latest Unicode 12.0 is downloaded.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Fixing the URL for test data should probably be another PR |
Yes, I'll do it. Thanks@tomcumming I don't have any other comments. |
Thank you! I'll push a release soonish |
Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing! |
Uh oh!
There was an error while loading.Please reload this page.
This is an implementation of thesentence breaks specification, including changes to the python files to grab the sentence break test data.
I welcome any advice for improving.