unicode-rs/unicode-segmentationPublic

NotificationsYou must be signed in to change notification settings
Fork65
Star632

Add grapheme iteration benchmarks for various languages.#78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Manishearth merged 3 commits intounicode-rs:masterfromcessen:grapheme_bench

Feb 14, 2020

Merged

Add grapheme iteration benchmarks for various languages.#78

Manishearth merged 3 commits intounicode-rs:masterfromcessen:grapheme_bench

Feb 14, 2020

Conversation

Copy link

Contributor

cessen commentedFeb 14, 2020•
edited
Loading

Grapheme iteration benchmarks for Arabic, English, Hindi, Japanese, Korean, Mandarin Chinese, Russian, and C source code.

Add grapheme iteration benchmarks for various languages.

c5bc229

Copy link

ContributorAuthor

cessen commentedFeb 14, 2020

Here are the cleaned up benchmarks. They aren'tquite the same as the ones I used in#77:

I basically re-built them to make extra sure all the text is from open-source/free-culture sources.
I took the time to source enough text to make most of the text unique, rather than a bunch of repeat copy-paste.
I reduced the text size to ~50kb each, as that seems like plenty.
I removed the zalgo and worst-case texts, as I don't think they're actually useful in practice and would likely just be confusing to people looking at the benchmarks in the future.

I wasn't totally sure how best to exclude the benches folder from publishing. Simply adding it to the exclude list actually causes packaging to fail, since its referenced by the[[bench]] section. So for now I only excludedbenches/texts, which contains the large text files. If I should do this differently, let me know!

Copy link

ContributorAuthor

cessen commentedFeb 14, 2020

Also, just to be clear: the benchmarks don't compile without the text files present. So I'm a bit nervous about this approach for excluding the benchmark texts from publishing. I'm not sure if there's any infrastructure that might not like that.

Manishearth approved these changes

Feb 14, 2020

View reviewed changes

Copy link

Member

Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This looks great! Can we mention the license for these files? It's CC-BY-SA so we need to mention the license and attribute it.

benches/graphemes.rs Outdated


		fn graphemes_english(bench: &mut Bencher) {
		bench.iter(\|\| {
		for g in UnicodeSegmentation::graphemes(TEXT_ENGLISH, true) {

Copy link

Member

ManishearthFeb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

the text itself should also pass throughblack_box. Probably doesn't matter given how large it is, but worth a shot.

Alternatively, we can load the file dynamically outside of theiter() call.

Copy link

Member

ManishearthFeb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pushed a fix that does this.

Copy link

Member

Manishearth commentedFeb 14, 2020

We've had complaints abouttests that break when run from published code, but this is usually packagers ensuring that their packaging went correctly. I don't think we'll have the same issue for benches. However, my understanding is that you don't need a[[bench]] entry if you just have abenches/ folderanyway, so we could totally just exclude it.

Copy link

Member

Manishearth commentedFeb 14, 2020

Oh, nope, it's necessary because you're using a custom harness. Darn.

Load bench data from file instead

5bf3107

Copy link

Member

Manishearth commentedFeb 14, 2020

Anyway, the benches will nowcompile, just not run if run from the package.

Add license and readme for benchmark texts

b1765ec

Manishearth merged commit485767a intounicode-rs:master

Feb 14, 2020

Copy link

Member

Manishearth commentedFeb 14, 2020

Pushed in some commits adding a license/attribution and making the benchmarks use files. Thanks for doing this!

Copy link

ContributorAuthor

cessen commentedFeb 15, 2020

Awesome, thanks for cleaning this up!

Out of curiosity, is there a way to do benchmarks in rustwithout a custom harness? My understanding was that the standard benchmarker isn't stable, so you always need a custom harness. If that's not the case, I'd be more than happy to change the benchmarker that's used.

cessen deleted the grapheme_bench branch

February 15, 2020 00:49

Copy link

Member

Manishearth commentedFeb 15, 2020

I didn't know that you can use custom harnesses this way on stable!

Yes, the default harness isn't stable, but most people just set up their CI to bench on nightly only.shrug

Labels

None yet

Movatterモバイル変換

Add grapheme iteration benchmarks for various languages.#78

Add grapheme iteration benchmarks for various languages.#78

Uh oh!

Conversation

cessen commentedFeb 14, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

cessen commentedFeb 14, 2020

Uh oh!

cessen commentedFeb 14, 2020

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

ManishearthFeb 14, 2020

Choose a reason for hiding this comment

Uh oh!

ManishearthFeb 14, 2020

Choose a reason for hiding this comment

Uh oh!

Manishearth commentedFeb 14, 2020

Uh oh!

Manishearth commentedFeb 14, 2020

Uh oh!

Manishearth commentedFeb 14, 2020

Uh oh!

Manishearth commentedFeb 14, 2020

Uh oh!

cessen commentedFeb 15, 2020

Uh oh!

Manishearth commentedFeb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cessen commentedFeb 14, 2020•
edited
Loading