@CarlosEduR uses a sensible approach that does not require 'deep' changes. This PR is somewhat deeper. Whether this PR is better than#2211 is an open question.

Note that@CarlosEduR checks for the null termination which is unnecessary work. Removing this check could change the story.

partial fix to#1470 Note that@jkeiser's idea is somewhat more involved than what we are doing currently.

Benchmarks on my ARM processor (Apple M2).

partial tweek

In this benchmark, we typically do not have escaped content in strings, but it does happen from time to time. I don't have an exact percentage but it is maybe 20% or 10% of the time that we have escaped content.

Using ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>". Run with sudo to get performance counters.

Main:
best_instructions_per_byte=8.68026 best_instructions_per_cycle=6.57819

@CarlosEduR's PR:
best_instructions_per_byte=9.31405 best_instructions_per_cycle=6.54882

This PR:
best_instructions_per_byte=8.66243 best_instructions_per_cycle=6.5704

find tweet

This is a lucky benchmark where we never have escaped content to worry about.

Using ./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>". Run with sudo to get performance counters.

Main
best_instructions_per_byte=4.71785 best_instructions_per_cycle=6.34339

@CarlosEduR's PR:
best_instructions_per_byte=4.7256 best_instructions_per_cycle=6.34974

This PR:
best_instructions_per_byte=4.71761 best_instructions_per_cycle=6.34714

Conclusion

It is too early to tell which direction this goes because (1) I only tested on one system and (2) only on two benchmarks.

ARM systems do not have to contend with runtime dispatches, so this is an advantage for this PR, compared to@CarlosEduR's PR. However,@CarlosEduR's PR could do better when runtime dispatching is needed.

@CarlosEduR's PR shows a regression which is possibly caused by the fact that it tries to avoid the copy, fails and then has to fall back on the current code. Even if it only happens one time out of 5 or 10, these unlucky cases could cost you.

@CarlosEduR's PR could be further optimized and the story might change.

Overall, my preliminary results suggest that on Apple Silicon, it is not worth avoiding a write on the string buffer.

lemire added2 commits

July 11, 2024 17:57

fix: add parse_string_if_needed function

76f45a0

let us be careful and not change the API

403b8bf

Copy link

Member

CarlosEduR commentedJul 12, 2024•
edited
Loading

This is an excellent PR,@lemire!
I appreciate the dedication, really nice~~jab~~ job!

I am using a AMD64, I'll run the benckmarks locally.

Copy link

MemberAuthor

lemire commentedJul 12, 2024

I appreciate the dedication

It is not super difficult, thankfully... but I am somewhat disappointed so far that we don't see much of an effect.

I am using a AMD64, I'll run the benckmarks locally.

That would be great. Please consider optimizing your own code (removing the!= '\0') as this could change the story.

Copy link

Member

CarlosEduR commentedJul 16, 2024

partial_tweets<simdjson_ondemand>

Daniel's PR:
best_instructions_per_byte=3.31552 best_instructions_per_cycle=3.22185

Master Branch:
best_instructions_per_byte=3.30848 best_instructions_per_cycle=3.27537

Carlos' PR
best_instructions_per_byte=3.6775 best_instructions_per_cycle=3.27884

find_tweet<simdjson_ondemand>

Master branch:
best_instructions_per_byte=2.28167 best_instructions_per_cycle=3.24331

Daniel's PR:
best_instructions_per_byte=2.28163 best_instructions_per_cycle=3.23015

Carlos' PR:
best_instructions_per_byte=2.28611 best_instructions_per_cycle=3.1972

I've not updated my code yet (removing the != '\0'), will do it and will share results.

Copy link

MemberAuthor

lemire commentedSep 18, 2024

@CarlosEduR

I've not updated my code yet (removing the != '\0'), will do it and will share results.

Did you get around to it?

Copy link

Member

CarlosEduR commentedSep 25, 2024

Did you get around to it?

yes!
partial_tweets and find_tweet benchmarks:

best_instructions_per_byte=3.58215 best_instructions_per_cycle=3.15066best_instructions_per_byte=2.28551 best_instructions_per_cycle=3.25264

Copy link

MemberAuthor

lemire commentedSep 25, 2024

@CarlosEduR

Hmmm, did the number of instructions go up in your find_tweet benchmark?

Just so we are clear, the idea was to reduce the amount of unnecessary work in your PR (check for terminating null). This should improve the performance and reduce the instruction count?

Labels

None yet

Movatterモバイル変換

fix: add parse_string_if_needed function#2212

Are you sure you want to change the base?

fix: add parse_string_if_needed function#2212

Uh oh!

Conversation

lemire commentedJul 11, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

partial tweek

find tweet

Conclusion

Uh oh!

CarlosEduR commentedJul 12, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lemire commentedJul 12, 2024

Uh oh!

CarlosEduR commentedJul 16, 2024

partial_tweets<simdjson_ondemand>

find_tweet<simdjson_ondemand>

Uh oh!

lemire commentedSep 18, 2024

Uh oh!

CarlosEduR commentedSep 25, 2024

Uh oh!

lemire commentedSep 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lemire commentedJul 11, 2024•
edited
Loading

CarlosEduR commentedJul 12, 2024•
edited
Loading