unicode-rs/unicode-segmentationPublic

NotificationsYou must be signed in to change notification settings
Fork65
Star632

Optimization for grapheme iteration.#77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Manishearth merged 2 commits intounicode-rs:masterfromcessen:faster_graphemes

Feb 12, 2020

Merged

Optimization for grapheme iteration.#77

Manishearth merged 2 commits intounicode-rs:masterfromcessen:faster_graphemes

Feb 12, 2020

Conversation

Copy link

Contributor

cessen commentedFeb 9, 2020

Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category. This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.

This results in significant speed ups on many texts, in some cases
up to 2x faster.

Optimization for grapheme iteration.

98fd158

Typical text involves a lot of chars one-after-another thatare in a similar scalar value range and in the same category.  Thiscommit takes advantage of that to speed up grapheme iteration bycaching the range and category of the last char, and always checkingagainst that cache first before doing a binary search on the wholegrapheme category table.This results in significant speed ups on many texts, in some casesup to 2x faster.

Copy link

ContributorAuthor

cessen commentedFeb 9, 2020

Here are some benchmarks I did on a variety of texts in various languages, most of which are sourced from the wikipedia articles about each respective language (e.g. the Mandarin Chinese text is from the Mandarin Chinese wikipedia page about the Mandarin Chinese language). They were all converted to plain utf8 text (no html markup) and copy-pasted to create roughly 200kb text files.

Before:

test iterate_text_arabic      ... bench:   4,200,019 ns/iter (+/- 100,975) = 43 MB/stest iterate_text_english     ... bench:   7,563,328 ns/iter (+/- 197,099) = 27 MB/stest iterate_text_hindi       ... bench:   3,408,050 ns/iter (+/- 78,840) = 56 MB/stest iterate_text_japanese    ... bench:   2,909,737 ns/iter (+/- 137,862) = 67 MB/stest iterate_text_korean      ... bench:   3,158,322 ns/iter (+/- 77,622) = 54 MB/stest iterate_text_mandarin    ... bench:   2,784,563 ns/iter (+/- 74,021) = 67 MB/stest iterate_text_russian     ... bench:   4,437,562 ns/iter (+/- 136,513) = 42 MB/stest iterate_text_source_code ... bench:   5,531,391 ns/iter (+/- 161,335) = 25 MB/stest iterate_text_worst_case  ... bench:   7,252,276 ns/iter (+/- 242,166) = 26 MB/stest iterate_text_zalgo       ... bench:   3,025,250 ns/iter (+/- 77,794) = 62 MB/s

After:

test iterate_text_arabic      ... bench:   3,150,853 ns/iter (+/- 178,068) = 58 MB/stest iterate_text_english     ... bench:   4,021,696 ns/iter (+/- 141,119) = 50 MB/stest iterate_text_hindi       ... bench:   3,462,884 ns/iter (+/- 85,012) = 55 MB/stest iterate_text_japanese    ... bench:   2,571,715 ns/iter (+/- 103,212) = 76 MB/stest iterate_text_korean      ... bench:   3,530,435 ns/iter (+/- 139,610) = 48 MB/stest iterate_text_mandarin    ... bench:   1,913,650 ns/iter (+/- 65,902) = 98 MB/stest iterate_text_russian     ... bench:   3,140,924 ns/iter (+/- 80,945) = 60 MB/stest iterate_text_source_code ... bench:   3,170,036 ns/iter (+/- 103,675) = 45 MB/stest iterate_text_worst_case  ... bench:   8,669,671 ns/iter (+/- 335,782) = 22 MB/stest iterate_text_zalgo       ... bench:   1,234,701 ns/iter (+/- 25,555) = 153 MB/s

With the exception of Korean and an artificial worst-case text I created, everything either improves in performance or breaks even. And the performance improvements are often substantial.

The worst-case text is comprised of alternating grapheme categories on every char, completely defeating the optimization, and leaving only the overhead. This should give a good indication of the worst-case performance drop that might happen.

Korean is a bit of a unique case, and due to the nature of Korean text is pretty close to a real-world equivalent of the artificial worst-case text. Nevertheless, the drop in performance seems (IMO) reasonable given the improvements for other texts.

The zalgo text case includes artificially long combining character sequences (https://en.wikipedia.org/wiki/Combining_character#Zalgo_text), and shouldn't be taken too seriously.

Fix to work on Rust 1.24.0 again and remove unused imports.

29ecd77

Manishearth approved these changes

Feb 12, 2020

View reviewed changes

Copy link

Member

Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is great! Could you also check in the benchmarks you wrote? Perhaps make abenches/ folder.

Be sure to exclude that folder from publishing in Cargo.toml.

Also, an additional optimization that can be done is hardcoding the table for ASCII and checking for that first.

Manishearth merged commitb82ed4d intounicode-rs:master

Feb 12, 2020

Copy link

Member

Manishearth commentedFeb 12, 2020

Eh, those things can be followups.

Copy link

ContributorAuthor

cessen commentedFeb 13, 2020

@Manishearth
Sure, I'll work on tidying up and committing the benchmarks the next chance I get (possibly this weekend, or maybe the weekend after).

Yeah, I've also been thinking about how to hard-code the ASCII range in some way. I think with some bit-fiddling it could be made extremely fast, because the only pure-ASCII case that needs handling is CRLF (I'm pretty sure?). And it would likely introduce very little overhead for other texts, I'm guessing.

Originally I was planning to do that optimization just in my own application code, bypassing UnicodeSegmentation altogether for those cases, but if you're interested in having it in-tree I'd be more than happy take a crack at it in another PR.

Copy link

Member

Manishearth commentedFeb 13, 2020

Oh, I meant that we could hardcode the categories so that ASCII doesn't hit the binary search

Copy link

ContributorAuthor

cessen commentedFeb 13, 2020•
edited
Loading

Ah, okay. I'm sure that would make ASCII text faster, though I think it would more be because it's effectively biasing the search to check ASCII first rather than because of the hard coding itself. We could choose to bias other ranges similarly, if we wanted to. So I guess the question is: is the ASCII range special enough to warrant doing that?

Edit: to clarify, I mean "special enough" in the general case. For e.g. my own project, a code editor, it definitely would be. But I can write my own ascii-specific optimizations in that case, as could other applications for their own domains.

Copy link

Member

Manishearth commentedFeb 14, 2020

@cessen it's special enough because a lot of punctuation/etc is ASCII even if text might not necessarily be. It's going to be a single comparison followed by an indexing operation, very cheap if the alternative is a binary search.

It will complement the cache really well since then the cache can continue to deal with the non-ASCII general range of the text whereas spacing and punctuation doesn't invalidate it. Would need to tweak the bsearch method to not return the cache index markers in the ASCII case though.

Copy link

ContributorAuthor

cessen commentedFeb 14, 2020

@Manishearth I think that really depends on the text. The CJK languages, for example, have their own punctuation and white space characters, and the only ascii-range characters they ever use are line endings, which is only occasional in typical texts.

But having said that, I think you're right that many other languages still use ascii punctuation and white space. And then there's things like html, too. I'd be happy to take a crack at this and see how it affects the benchmarks and if it seems worth whatever overhead it might introduce.

Copy link

Member

Manishearth commentedFeb 14, 2020

Yes, not all languages do this.

This was definitely a win for unicode-xid inunicode-rs/unicode-xid#16 ( which doesn't even go the full way with the ASCII table, just reorders the binary search!), which admittedly works with different sets of characters