unicode-rs/unicode-segmentationPublic

NotificationsYou must be signed in to change notification settings
Fork65
Star632

Add ascii fast path for unicode_word_indices and unicode_words#147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Manishearth merged 13 commits intounicode-rs:masterfromPSeitz:master

Jul 28, 2025

Merged

Add ascii fast path for unicode_word_indices and unicode_words#147

Manishearth merged 13 commits intounicode-rs:masterfromPSeitz:master

Jul 28, 2025

Conversation

Copy link

Contributor

PSeitz-dd commentedJul 11, 2025•
edited
Loading

This PR adds a fast path forunicode_word_indices andunicode_words.

Add ascii version behind ascii check
Add benchmark
Add enum Iterator to wrap ascii and non-ascii into a single iterator
Add proptest to make sure ascii and non-ascii methods behave exactly the same.

Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.

unicode_word_indices/log                        time:   [209.68 ns 212.35 ns 215.32 ns]                        thrpt:  [761.82 MiB/s 772.45 MiB/s 782.29 MiB/s]                 change:                        time:   [-82.313% -82.117% -81.888%] (p = 0.00 < 0.05)                        thrpt:  [+452.12% +459.18% +465.38%]                        Performance has improved.Found 3 outliers among 100 measurements (3.00%)  3 (3.00%) high mildunicode_word_indices/english                        time:   [426.95 µs 428.03 µs 429.18 µs]                        thrpt:  [110.42 MiB/s 110.72 MiB/s 110.99 MiB/s]                 change:                        time:   [+2.0057% +2.3798% +2.7729%] (p = 0.00 < 0.05)                        thrpt:  [-2.6981% -2.3245% -1.9662%]                        Performance has regressed.Found 6 outliers among 100 measurements (6.00%)  2 (2.00%) low mild  4 (4.00%) high mildunicode_word_indices/japanese                        time:   [415.66 µs 416.25 µs 416.86 µs]                        thrpt:  [116.02 MiB/s 116.18 MiB/s 116.35 MiB/s]                 change:                        time:   [+0.5766% +0.8184% +1.0398%] (p = 0.00 < 0.05)                        thrpt:  [-1.0291% -0.8118% -0.5733%]

PSeitz-dd added3 commits

July 11, 2025 13:01

add benchmark

592d99d

add ascii fastpath

eca9043

add test case IP

b5ed407

PSeitz-dd force-pushed themaster branch from9a85968 tob5ed407Compare

July 11, 2025 06:27

add log to benches

9b1b7f9

Manishearth requested changes

Jul 14, 2025

View reviewed changes

src/word.rs

		/// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words
		/// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html
		#[derive(Debug)]
		pubstructUnicodeWords<'a>{

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

issue: removing these types is a breaking change

We deliberately have concrete types here so that they can be used inside other things conveniently.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)

PSeitz-dd added3 commits

July 15, 2025 19:55

restore iterators

6f96a23

add backwards iterator

7beb8a6

restore test

a3881da

Manishearth reviewed

Jul 15, 2025

View reviewed changes

src/word.rs Outdated

		pubstructUnicodeWordIndices<'a>{
		#[allow(clippy::type_complexity)]
		inner:Filter<UWordBoundIndices<'a>,fn(&(usize,&str)) ->bool>,
		inner:Box<dynDoubleEndedIterator<Item =(usize,&'astr)> +'a>,

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't want this to allocate either.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I replaced it with a custom enum iterator (it's also much faster :))

Manishearth reviewed

Jul 15, 2025

View reviewed changes

Copy link

Member

Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

src/word.rsShow resolvedHide resolved

PSeitz-dd added2 commits

July 16, 2025 20:15

replace Box with Enum

7599d62

add comments with reference to the spec

e29c432

Copy link

ContributorAuthor

PSeitz-dd commentedJul 17, 2025

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now.