- Notifications
You must be signed in to change notification settings - Fork65
Add ascii fast path for unicode_word_indices and unicode_words#147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
| /// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words | ||
| /// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html | ||
| #[derive(Debug)] | ||
| pubstructUnicodeWords<'a>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
issue: removing these types is a breaking change
We deliberately have concrete types here so that they can be used inside other things conveniently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)
src/word.rs Outdated
| pubstructUnicodeWordIndices<'a>{ | ||
| #[allow(clippy::type_complexity)] | ||
| inner:Filter<UWordBoundIndices<'a>,fn(&(usize,&str)) ->bool>, | ||
| inner:Box<dynDoubleEndedIterator<Item =(usize,&'astr)> +'a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't want this to allocate either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I replaced it with a custom enum iterator (it's also much faster :))
Manishearth left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)
Uh oh!
There was an error while loading.Please reload this page.
PSeitz-dd commentedJul 17, 2025
I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now. |
src/word.rs Outdated
| /// | ||
| /// Any other single ASCII byte is its own boundary (the default WB999). | ||
| #[derive(Debug)] | ||
| pubstructAsciiWordBoundIter<'a>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This should not bepub, maybepub(crate) if you need it to be
Manishearth left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks!!
af87c8d intounicode-rs:masterUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This PR adds a fast path for
unicode_word_indicesandunicode_words.Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.