- Notifications
You must be signed in to change notification settings - Fork61
Implement a special-case lookup for ascii grapheme categories.#79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Implement a special-case lookup for ascii grapheme categories.#79
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This speeds up processing even for many non-ascii texts, sincethey often still use ascii-range punctuation and whitespace.
I took a crack at special-casing the ASCII range as discussed in#77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach. Here are the bench results, including results for both approaches:
Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories. |
(Also, sorry for the commit message typo. I can amend the commit if it's important to you.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Oh, right, for graphemes there's not much of a need for tables, good call.
The tables vs if branchingmight be a bigger deal if we want to do this for word/line/sentence breaking.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Thanks! |
Awesome! Thanks for the clean up and merge! |
This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.