Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Optimization for grapheme iteration.#77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Manishearth merged 2 commits intounicode-rs:masterfromcessen:faster_graphemes
Feb 12, 2020
Merged

Optimization for grapheme iteration.#77

Manishearth merged 2 commits intounicode-rs:masterfromcessen:faster_graphemes
Feb 12, 2020

Conversation

cessen
Copy link
Contributor

Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category. This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.

This results in significant speed ups on many texts, in some cases
up to 2x faster.

mbrubeck and adsick reacted with heart emoji
Typical text involves a lot of chars one-after-another thatare in a similar scalar value range and in the same category.  Thiscommit takes advantage of that to speed up grapheme iteration bycaching the range and category of the last char, and always checkingagainst that cache first before doing a binary search on the wholegrapheme category table.This results in significant speed ups on many texts, in some casesup to 2x faster.
@cessen
Copy link
ContributorAuthor

Here are some benchmarks I did on a variety of texts in various languages, most of which are sourced from the wikipedia articles about each respective language (e.g. the Mandarin Chinese text is from the Mandarin Chinese wikipedia page about the Mandarin Chinese language). They were all converted to plain utf8 text (no html markup) and copy-pasted to create roughly 200kb text files.

Before:

test iterate_text_arabic      ... bench:   4,200,019 ns/iter (+/- 100,975) = 43 MB/stest iterate_text_english     ... bench:   7,563,328 ns/iter (+/- 197,099) = 27 MB/stest iterate_text_hindi       ... bench:   3,408,050 ns/iter (+/- 78,840) = 56 MB/stest iterate_text_japanese    ... bench:   2,909,737 ns/iter (+/- 137,862) = 67 MB/stest iterate_text_korean      ... bench:   3,158,322 ns/iter (+/- 77,622) = 54 MB/stest iterate_text_mandarin    ... bench:   2,784,563 ns/iter (+/- 74,021) = 67 MB/stest iterate_text_russian     ... bench:   4,437,562 ns/iter (+/- 136,513) = 42 MB/stest iterate_text_source_code ... bench:   5,531,391 ns/iter (+/- 161,335) = 25 MB/stest iterate_text_worst_case  ... bench:   7,252,276 ns/iter (+/- 242,166) = 26 MB/stest iterate_text_zalgo       ... bench:   3,025,250 ns/iter (+/- 77,794) = 62 MB/s

After:

test iterate_text_arabic      ... bench:   3,150,853 ns/iter (+/- 178,068) = 58 MB/stest iterate_text_english     ... bench:   4,021,696 ns/iter (+/- 141,119) = 50 MB/stest iterate_text_hindi       ... bench:   3,462,884 ns/iter (+/- 85,012) = 55 MB/stest iterate_text_japanese    ... bench:   2,571,715 ns/iter (+/- 103,212) = 76 MB/stest iterate_text_korean      ... bench:   3,530,435 ns/iter (+/- 139,610) = 48 MB/stest iterate_text_mandarin    ... bench:   1,913,650 ns/iter (+/- 65,902) = 98 MB/stest iterate_text_russian     ... bench:   3,140,924 ns/iter (+/- 80,945) = 60 MB/stest iterate_text_source_code ... bench:   3,170,036 ns/iter (+/- 103,675) = 45 MB/stest iterate_text_worst_case  ... bench:   8,669,671 ns/iter (+/- 335,782) = 22 MB/stest iterate_text_zalgo       ... bench:   1,234,701 ns/iter (+/- 25,555) = 153 MB/s

With the exception of Korean and an artificial worst-case text I created, everything either improves in performance or breaks even. And the performance improvements are often substantial.

The worst-case text is comprised of alternating grapheme categories on every char, completely defeating the optimization, and leaving only the overhead. This should give a good indication of the worst-case performance drop that might happen.

Korean is a bit of a unique case, and due to the nature of Korean text is pretty close to a real-world equivalent of the artificial worst-case text. Nevertheless, the drop in performance seems (IMO) reasonable given the improvements for other texts.

The zalgo text case includes artificially long combining character sequences (https://en.wikipedia.org/wiki/Combining_character#Zalgo_text), and shouldn't be taken too seriously.

Copy link
Member

@ManishearthManishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is great! Could you also check in the benchmarks you wrote? Perhaps make abenches/ folder.

Be sure to exclude that folder from publishing in Cargo.toml.

Also, an additional optimization that can be done is hardcoding the table for ASCII and checking for that first.

@ManishearthManishearth merged commitb82ed4d intounicode-rs:masterFeb 12, 2020
@Manishearth
Copy link
Member

Eh, those things can be followups.

@cessen
Copy link
ContributorAuthor

@Manishearth
Sure, I'll work on tidying up and committing the benchmarks the next chance I get (possibly this weekend, or maybe the weekend after).

Yeah, I've also been thinking about how to hard-code the ASCII range in some way. I think with some bit-fiddling it could be made extremely fast, because the only pure-ASCII case that needs handling is CRLF (I'm pretty sure?). And it would likely introduce very little overhead for other texts, I'm guessing.

Originally I was planning to do that optimization just in my own application code, bypassing UnicodeSegmentation altogether for those cases, but if you're interested in having it in-tree I'd be more than happy take a crack at it in another PR.

@Manishearth
Copy link
Member

Oh, I meant that we could hardcode the categories so that ASCII doesn't hit the binary search

@cessen
Copy link
ContributorAuthor

cessen commentedFeb 13, 2020
edited
Loading

Ah, okay. I'm sure that would make ASCII text faster, though I think it would more be because it's effectively biasing the search to check ASCII first rather than because of the hard coding itself. We could choose to bias other ranges similarly, if we wanted to. So I guess the question is: is the ASCII range special enough to warrant doing that?

Edit: to clarify, I mean "special enough" in the general case. For e.g. my own project, a code editor, it definitely would be. But I can write my own ascii-specific optimizations in that case, as could other applications for their own domains.

@Manishearth
Copy link
Member

@cessen it's special enough because a lot of punctuation/etc is ASCII even if text might not necessarily be. It's going to be a single comparison followed by an indexing operation, very cheap if the alternative is a binary search.

It will complement the cache really well since then the cache can continue to deal with the non-ASCII general range of the text whereas spacing and punctuation doesn't invalidate it. Would need to tweak the bsearch method to not return the cache index markers in the ASCII case though.

@cessen
Copy link
ContributorAuthor

@Manishearth I think that really depends on the text. The CJK languages, for example, have their own punctuation and white space characters, and the only ascii-range characters they ever use are line endings, which is only occasional in typical texts.

But having said that, I think you're right that many other languages still use ascii punctuation and white space. And then there's things like html, too. I'd be happy to take a crack at this and see how it affects the benchmarks and if it seems worth whatever overhead it might introduce.

@Manishearth
Copy link
Member

Yes, not all languages do this.

This was definitely a win for unicode-xid inunicode-rs/unicode-xid#16 ( which doesn't even go the full way with the ASCII table, just reorders the binary search!), which admittedly works with different sets of characters

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ManishearthManishearthManishearth approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@cessen@Manishearth

[8]ページ先頭

©2009-2025 Movatter.jp