Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Implement a special-case lookup for ascii grapheme categories.#79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Manishearth merged 2 commits intounicode-rs:masterfromcessen:ascii_grapheme_optimization
Feb 14, 2020
Merged

Implement a special-case lookup for ascii grapheme categories.#79

Manishearth merged 2 commits intounicode-rs:masterfromcessen:ascii_grapheme_optimization
Feb 14, 2020

Conversation

cessen
Copy link
Contributor

This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.

schneiderfelipe reacted with thumbs up emoji
This speeds up processing even for many non-ascii texts, sincethey often still use ascii-range punctuation and whitespace.
@cessencessen changed the titleImplement a special-case lookup for ascii grapeheme categories.Implement a special-case lookup for ascii grapheme categories.Feb 14, 2020
@cessen
Copy link
ContributorAuthor

I took a crack at special-casing the ASCII range as discussed in#77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach.

Here are the bench results, including results for both approaches:

No ascii special-casing:test graphemes_arabic      ... bench:     893,766 ns/iter (+/- 68,494) = 56 MB/stest graphemes_english     ... bench:     989,329 ns/iter (+/- 42,635) = 50 MB/stest graphemes_hindi       ... bench:     933,926 ns/iter (+/- 30,817) = 53 MB/stest graphemes_japanese    ... bench:     727,820 ns/iter (+/- 37,069) = 69 MB/stest graphemes_korean      ... bench:   1,083,001 ns/iter (+/- 33,362) = 46 MB/stest graphemes_mandarin    ... bench:     520,048 ns/iter (+/- 10,724) = 97 MB/stest graphemes_russian     ... bench:     842,000 ns/iter (+/- 32,343) = 60 MB/stest graphemes_source_code ... bench:   1,060,247 ns/iter (+/- 42,456) = 47 MB/sTable:test graphemes_arabic      ... bench:     692,838 ns/iter (+/- 34,703) = 72 MB/stest graphemes_english     ... bench:   1,018,363 ns/iter (+/- 48,135) = 48 MB/stest graphemes_hindi       ... bench:     826,851 ns/iter (+/- 39,035) = 59 MB/stest graphemes_japanese    ... bench:     739,136 ns/iter (+/- 68,602) = 68 MB/stest graphemes_korean      ... bench:   1,000,320 ns/iter (+/- 85,934) = 50 MB/stest graphemes_mandarin    ... bench:     522,754 ns/iter (+/- 24,797) = 96 MB/stest graphemes_russian     ... bench:     717,681 ns/iter (+/- 25,272) = 71 MB/stest graphemes_source_code ... bench:   1,008,405 ns/iter (+/- 31,694) = 49 MB/sIf branching:test graphemes_arabic      ... bench:     663,521 ns/iter (+/- 36,100) = 75 MB/stest graphemes_english     ... bench:     927,851 ns/iter (+/- 35,652) = 53 MB/stest graphemes_hindi       ... bench:     801,972 ns/iter (+/- 35,350) = 61 MB/stest graphemes_japanese    ... bench:     716,374 ns/iter (+/- 40,118) = 70 MB/stest graphemes_korean      ... bench:     964,766 ns/iter (+/- 47,644) = 51 MB/stest graphemes_mandarin    ... bench:     503,575 ns/iter (+/- 15,838) = 100 MB/stest graphemes_russian     ... bench:     665,667 ns/iter (+/- 27,923) = 76 MB/stest graphemes_source_code ... bench:     959,008 ns/iter (+/- 66,405) = 52 MB/s

Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the sliceget() method to roll the bounds check and ascii-range check into one, so the whole thing was just a single branch and an array lookup.

The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories.

@cessen
Copy link
ContributorAuthor

(Also, sorry for the commit message typo. I can amend the commit if it's important to you.)

Copy link
Member

@ManishearthManishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh, right, for graphemes there's not much of a need for tables, good call.

The tables vs if branchingmight be a bigger deal if we want to do this for word/line/sentence breaking.

@ManishearthManishearth merged commitfbba2a6 intounicode-rs:masterFeb 14, 2020
@Manishearth
Copy link
Member

Thanks!

@cessencessen deleted the ascii_grapheme_optimization branchFebruary 15, 2020 00:52
@cessen
Copy link
ContributorAuthor

Awesome! Thanks for the clean up and merge!

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ManishearthManishearthManishearth approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@cessen@Manishearth

[8]ページ先頭

©2009-2025 Movatter.jp