Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-144157: Optimize bytes.translate() by deferring change detection#144158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@gpshead
Copy link
Member

@gpsheadgpshead commentedJan 22, 2026
edited by bedevere-appbot
Loading

Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.

This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.

It will also be faster for bytes subclasses as those do not need change
detection.

Move the equality check out of the hot loop to allow better compileroptimization. Instead of checking each byte during translation, performa single memcmp at the end to determine if the input can be returnedunchanged.This allows compilers to unroll and pipeline the loops, resulting in ~2xthroughput improvement for medium-to-large inputs (tested on an AMD zen2).No change observed on small inputs.It will also be faster for bytes subclasses as those do not need changedetection.
Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The loop already ran*output++ = table_chars[c]; before the change, the change only moves thechanged = 1 logic outside the loop.

gpshead reacted with thumbs up emoji
@gpsheadgpshead merged commita966d94 intopython:mainJan 22, 2026
53 checks passed
@gpsheadgpshead self-assigned thisJan 22, 2026
@sergey-miryanov
Copy link
Contributor

May I ask you on what length do you test it (medium-to-large inputs)?

@gpshead
Copy link
MemberAuthor

May I ask you on what length do you test it (medium-to-large inputs)?

I tested using 64 bytes - 256k as a microbenchmark usinghttps://github.com/gpshead/cpython/blob/6d1b11ac1d84228f5ee7b5d4f3ab0c7fb77b7719/Tools/scripts/translate_bench.py#L454-L457 with --bytes_only. claude wrote that and I didn't spend much time looking it over, i'd have written it a bit differently myself to reduce overhead further given it's a microbenchmark, but it works and demonstrates the change and lack of tiny data regression regardless.

skimming my data, the result was already a clear 10-15% improvement at 64 bytes and approached 2x as the size got larger on my zen2.

i didn't spend time looking at the asm generated, but it makes sense in this case: that "changed" test was being done in the loop for every byte despite being something that only needs to short circuit evaluate. this way it is removed and the hot paths of translation and maybe change detection are both parallelizable memory streaming operationsand change detection short circuit evaluates and exits the memcmp upon first changed byte. (thus an identity translation with no changes seeing a slightly lower performance gain than others)

Roughly a 2x speedup for large inputs. For smaller inputs (64-127 bytes), the gains are more modest at 8-25% faster where the fixed overhead of the call dominates. I neglected to measure smaller than that, but I do not expect any meaningfully measurable regression.

expand for a detailed table (x86_64 zen2 gcc 15.2)
bytes: nibble swap (no del)                                                       |bytes: nibble swap (no del)------------------------------------------------------------                      |------------------------------------------------------------      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len------------------------------------------------------------                      |------------------------------------------------------------        64         88.2       0.73         64                                     |        64         95.9       0.67         64       100        106.2       0.94        100                                     |       100        127.8       0.78        100       127        109.5       1.16        127                                     |       127        142.2       0.89        127       256        147.2       1.74        256                                     |       256        221.1       1.16        256       500        230.3       2.17        500                                     |       500        373.0       1.34        500      1000        427.1       2.34       1000                                     |      1000        723.8       1.38       1000      1024        441.6       2.32       1024                                     |      1024        729.2       1.40       1024      4096       1324.8       3.09       4096                                     |      4096       2590.0       1.58       4096     16384       4978.0       3.29      16384                                     |     16384       9922.2       1.65      16384     65536      19391.2       3.38      65536                                     |     65536      39363.0       1.66      65536    262144      77818.4       3.37     262144                                     |    262144     155573.2       1.69     262144                                                                                  |bytes: identity (no del)                                                          |bytes: identity (no del)------------------------------------------------------------                      |------------------------------------------------------------      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len------------------------------------------------------------                      |------------------------------------------------------------        64         89.2       0.72         64                                     |        64         98.8       0.65         64       100        109.5       0.91        100                                     |       100        131.9       0.76        100       127        110.4       1.15        127                                     |       127        146.3       0.87        127       256        145.8       1.76        256                                     |       256        232.9       1.10        256       500        234.8       2.13        500                                     |       500        380.8       1.31        500      1000        421.6       2.37       1000                                     |      1000        706.9       1.41       1000      1024        433.8       2.36       1024                                     |      1024        724.9       1.41       1024      4096       1335.4       3.07       4096                                     |      4096       2526.7       1.62       4096     16384       5109.1       3.21      16384                                     |     16384       9808.6       1.67      16384     65536      20334.0       3.22      65536                                     |     65536      39629.5       1.65      65536    262144      82627.2       3.17     262144                                     |    262144     156007.6       1.68     262144

Other platforms?

rerunning the benchmark on 32-bit raspbian (arm32) on a rpi5, there are still gains. I included smaller 8,20,32 sizes in this run. but the overall result is less impressive. 2%-30% at most for 64 bytes on up. slightly slower on the tiny sizes but close enough it could be in the noise. this lower spec arm probably doesn't pipeline as well or coalesce writes.

and rerunning it on a 64-bit raspbian (arm64) rpi4 (wow those feel slow these days...), much better gains than the arm32 above. closer to what x86_64 zen2 saw. 10%-170% 64 bytes through 256k. insignificant for 32 bytes and below.

sergey-miryanov reacted with rocket emoji

@sergey-miryanov
Copy link
Contributor

@gpshead Whoa! Many thanks for such detailed answer!

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@vstinnervstinnervstinner approved these changes

Assignees

@gpsheadgpshead

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@gpshead@sergey-miryanov@vstinner

[8]ページ先頭

©2009-2026 Movatter.jp