unicode-rs/unicode-segmentationPublic

NotificationsYou must be signed in to change notification settings
Fork65
Star632

Unicode sentence boundaries#24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Manishearth merged 4 commits intounicode-rs:masterfromtomcumming:master

May 15, 2019

Merged

Unicode sentence boundaries#24

Manishearth merged 4 commits intounicode-rs:masterfromtomcumming:master

May 15, 2019

Conversation

Copy link

Contributor

tomcumming commentedMay 6, 2017•
edited
Loading

This is an implementation of thesentence breaks specification, including changes to the python files to grab the sentence break test data.

I welcome any advice for improving.

Fetch and generate sentence tests, property table

fa10dd3

tomcumming changed the title~~Code review please~~Code review please (Unicode sentence boundary partial implementation)

May 6, 2017

tomcumming force-pushed themaster branch from14dbeb8 to93b0d56Compare

May 16, 2017 20:08

Added forward iterator for unicode sentences

7ac6f29

Passes all tests in the examples provided here:http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt

tomcumming force-pushed themaster branch from93b0d56 to7ac6f29Compare

May 16, 2017 20:10

tomcumming changed the title~~Code review please (Unicode sentence boundary partial implementation)~~Unicode sentence boundaries

May 16, 2017

Copy link

Member

Manishearth commentedMay 16, 2017

Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon!

Sentence boundaries is something I've wanted implemented here for a while 😄

Copy link

Member

Manishearth commentedMay 26, 2017

(still no time to look at this, apologies. Really hope to get to it soon)

rth mentioned this pull request

May 3, 2019

Add sentence splitterrth/vtext#51

Closed

Copy link

Contributor

rth commentedMay 6, 2019

This PR looks quite good and it would be great to have this functionality. Could I help in any way ?

Copy link

ContributorAuthor

tomcumming commentedMay 7, 2019

@rth I can split this out into another crate if required?

Copy link

Contributor

@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) inrth/vtext#51 . Having this implementation in theunicode-segmentation crate would be ideal, but if it is unlikely to be reviewed in the near future, maybe putting it in some other crate could be a workaround.

Any chance@Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it?

Copy link

Member

Manishearth commentedMay 9, 2019

@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it.

Copy link

Contributor

rth commentedMay 9, 2019

Sure, I'll try to review it in the next few days.

Manishearth reviewed

May 9, 2019

View reviewed changes

Copy link

Member

Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Code looks correct! Mostly want more documentation.

src/lib.rsShow resolvedHide resolved

src/sentence.rsShow resolvedHide resolved

src/sentence.rs OutdatedShow resolvedHide resolved

src/sentence.rsShow resolvedHide resolved

tomcumming added2 commits

May 13, 2019 19:06

Adds unicode_sentences and split_sentence_bound_indices

50058a5

Documentation and code reorg

9c7abf2

Copy link

ContributorAuthor

tomcumming commentedMay 13, 2019

@Manishearth @rth I have updated the PR including requested changes

Manishearth approved these changes

May 13, 2019

View reviewed changes

Copy link

Member

Manishearth commentedMay 13, 2019

Looks good!@rth want to do a second review?

rth reviewed

May 14, 2019

View reviewed changes

Copy link

Contributor

rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks a lot for the review@Manishearth !

I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).

I can confirm thatsrc/tables.rs andsrc/testdata.rs in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py

-        os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"+        os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"

as otherwise data for latest Unicode 12.0 is downloaded.

src/lib.rsShow resolvedHide resolved

rth reviewed

May 14, 2019

View reviewed changes

src/sentence.rsShow resolvedHide resolved

Copy link

ContributorAuthor

tomcumming commentedMay 15, 2019

Fixing the URL for test data should probably be another PR

Copy link

Contributor

rth commentedMay 15, 2019

Fixing the URL for test data should probably be another PR

Yes, I'll do it.

Thanks@tomcumming I don't have any other comments.

rth mentioned this pull request

May 15, 2019

MAINT Fixes for Python scripts#54

Merged

Manishearth merged commitc7a6b6f intounicode-rs:master

May 15, 2019

Copy link

Member

Manishearth commentedMay 15, 2019

Thank you! I'll push a release soonish

Copy link

Member

Manishearth commentedMay 15, 2019

Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing!

Labels

None yet

3 participants

Movatterモバイル変換

Unicode sentence boundaries#24

Unicode sentence boundaries#24

Uh oh!

Conversation

tomcumming commentedMay 6, 2017• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Manishearth commentedMay 16, 2017

Uh oh!

Manishearth commentedMay 26, 2017

Uh oh!

rth commentedMay 6, 2019

Uh oh!

tomcumming commentedMay 7, 2019

Uh oh!

rth commentedMay 8, 2019

Uh oh!

Manishearth commentedMay 9, 2019

Uh oh!

rth commentedMay 9, 2019

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomcumming commentedMay 13, 2019

Uh oh!

Manishearth commentedMay 13, 2019

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomcumming commentedMay 15, 2019

Uh oh!

rth commentedMay 15, 2019

Uh oh!

Manishearth commentedMay 15, 2019

Uh oh!

Manishearth commentedMay 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomcumming commentedMay 6, 2017•
edited
Loading