Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add ascii fast path for unicode_word_indices and unicode_words#147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Manishearth merged 13 commits intounicode-rs:masterfromPSeitz:master
Jul 28, 2025

Conversation

@PSeitz-dd
Copy link
Contributor

@PSeitz-ddPSeitz-dd commentedJul 11, 2025
edited
Loading

This PR adds a fast path forunicode_word_indices andunicode_words.

  • Add ascii version behind ascii check
  • Add benchmark
  • Add enum Iterator to wrap ascii and non-ascii into a single iterator
  • Add proptest to make sure ascii and non-ascii methods behave exactly the same.

Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.

unicode_word_indices/log                        time:   [209.68 ns 212.35 ns 215.32 ns]                        thrpt:  [761.82 MiB/s 772.45 MiB/s 782.29 MiB/s]                 change:                        time:   [-82.313% -82.117% -81.888%] (p = 0.00 < 0.05)                        thrpt:  [+452.12% +459.18% +465.38%]                        Performance has improved.Found 3 outliers among 100 measurements (3.00%)  3 (3.00%) high mildunicode_word_indices/english                        time:   [426.95 µs 428.03 µs 429.18 µs]                        thrpt:  [110.42 MiB/s 110.72 MiB/s 110.99 MiB/s]                 change:                        time:   [+2.0057% +2.3798% +2.7729%] (p = 0.00 < 0.05)                        thrpt:  [-2.6981% -2.3245% -1.9662%]                        Performance has regressed.Found 6 outliers among 100 measurements (6.00%)  2 (2.00%) low mild  4 (4.00%) high mildunicode_word_indices/japanese                        time:   [415.66 µs 416.25 µs 416.86 µs]                        thrpt:  [116.02 MiB/s 116.18 MiB/s 116.35 MiB/s]                 change:                        time:   [+0.5766% +0.8184% +1.0398%] (p = 0.00 < 0.05)                        thrpt:  [-1.0291% -0.8118% -0.5733%]

/// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words
/// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html
#[derive(Debug)]
pubstructUnicodeWords<'a>{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

issue: removing these types is a breaking change

We deliberately have concrete types here so that they can be used inside other things conveniently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)

src/word.rs Outdated
pubstructUnicodeWordIndices<'a>{
#[allow(clippy::type_complexity)]
inner:Filter<UWordBoundIndices<'a>,fn(&(usize,&str)) ->bool>,
inner:Box<dynDoubleEndedIterator<Item =(usize,&'astr)> +'a>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't want this to allocate either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I replaced it with a custom enum iterator (it's also much faster :))

Copy link
Member

@ManishearthManishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

@PSeitz-dd
Copy link
ContributorAuthor

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now.

src/word.rs Outdated
///
/// Any other single ASCII byte is its own boundary (the default WB999).
#[derive(Debug)]
pubstructAsciiWordBoundIter<'a>{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should not bepub, maybepub(crate) if you need it to be

Copy link
Member

@ManishearthManishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks!!

@ManishearthManishearth merged commitaf87c8d intounicode-rs:masterJul 28, 2025
2 checks passed
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ManishearthManishearthManishearth approved these changes

+1 more reviewer

@PSeitzPSeitzPSeitz left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@PSeitz-dd@PSeitz@Manishearth

[8]ページ先頭

©2009-2025 Movatter.jp