NotificationsYou must be signed in to change notification settings
Fork34k
Star71.3k

gh-144157: Optimize bytes.translate() by deferring change detection#144158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

gpshead merged 2 commits intopython:mainfromgpshead:opt-bytes-translate-let-it-unroll

Jan 22, 2026

Merged

gh-144157: Optimize bytes.translate() by deferring change detection#144158

gpshead merged 2 commits intopython:mainfromgpshead:opt-bytes-translate-let-it-unroll

Jan 22, 2026

Conversation

Copy link

Member

gpshead commentedJan 22, 2026•
edited by bedevere-appbot
Loading

Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.

This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.

It will also be faster for bytes subclasses as those do not need change
detection.

Issue:optimize bytes.translate by letting the compiler unroll the loop more usefully #144157

gpshead added2 commits

January 22, 2026 16:08

Optimize bytes.translate() by deferring change detection

3bf238d

Move the equality check out of the hot loop to allow better compileroptimization. Instead of checking each byte during translation, performa single memcmp at the end to determine if the input can be returnedunchanged.This allows compilers to unroll and pipeline the loops, resulting in ~2xthroughput improvement for medium-to-large inputs (tested on an AMD zen2).No change observed on small inputs.It will also be faster for bytes subclasses as those do not need changedetection.

NEWS entry

46c1623

bedevere-appbot mentioned this pull request

Jan 22, 2026

optimize bytes.translate by letting the compiler unroll the loop more usefully#144157

Closed

bedevere-appbot added the awaiting core review label

Jan 22, 2026

gpshead requested a review fromvstinner

January 22, 2026 16:23

vstinner approved these changes

Jan 22, 2026

View reviewed changes

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

bedevere-appbot added awaiting merge and removed awaiting core review labels

Jan 22, 2026

vstinner reviewed

Jan 22, 2026

View reviewed changes

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The loop already ran*output++ = table_chars[c]; before the change, the change only moves thechanged = 1 logic outside the loop.

gpshead merged commita966d94 intopython:main

Jan 22, 2026

53 checks passed

bedevere-appbot removed the awaiting merge label

Jan 22, 2026

gpshead self-assigned this

Jan 22, 2026

Copy link

Contributor

sergey-miryanov commentedJan 22, 2026

May I ask you on what length do you test it (medium-to-large inputs)?

Copy link

MemberAuthor

gpshead commentedJan 22, 2026

May I ask you on what length do you test it (medium-to-large inputs)?

I tested using 64 bytes - 256k as a microbenchmark usinghttps://github.com/gpshead/cpython/blob/6d1b11ac1d84228f5ee7b5d4f3ab0c7fb77b7719/Tools/scripts/translate_bench.py#L454-L457 with --bytes_only. claude wrote that and I didn't spend much time looking it over, i'd have written it a bit differently myself to reduce overhead further given it's a microbenchmark, but it works and demonstrates the change and lack of tiny data regression regardless.

skimming my data, the result was already a clear 10-15% improvement at 64 bytes and approached 2x as the size got larger on my zen2.

i didn't spend time looking at the asm generated, but it makes sense in this case: that "changed" test was being done in the loop for every byte despite being something that only needs to short circuit evaluate. this way it is removed and the hot paths of translation and maybe change detection are both parallelizable memory streaming operationsand change detection short circuit evaluates and exits the memcmp upon first changed byte. (thus an identity translation with no changes seeing a slightly lower performance gain than others)

Roughly a 2x speedup for large inputs. For smaller inputs (64-127 bytes), the gains are more modest at 8-25% faster where the fixed overhead of the call dominates. I neglected to measure smaller than that, but I do not expect any meaningfully measurable regression.

expand for a detailed table (x86_64 zen2 gcc 15.2)

bytes: nibble swap (no del)                                                       |bytes: nibble swap (no del)------------------------------------------------------------                      |------------------------------------------------------------      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len------------------------------------------------------------                      |------------------------------------------------------------        64         88.2       0.73         64                                     |        64         95.9       0.67         64       100        106.2       0.94        100                                     |       100        127.8       0.78        100       127        109.5       1.16        127                                     |       127        142.2       0.89        127       256        147.2       1.74        256                                     |       256        221.1       1.16        256       500        230.3       2.17        500                                     |       500        373.0       1.34        500      1000        427.1       2.34       1000                                     |      1000        723.8       1.38       1000      1024        441.6       2.32       1024                                     |      1024        729.2       1.40       1024      4096       1324.8       3.09       4096                                     |      4096       2590.0       1.58       4096     16384       4978.0       3.29      16384                                     |     16384       9922.2       1.65      16384     65536      19391.2       3.38      65536                                     |     65536      39363.0       1.66      65536    262144      77818.4       3.37     262144                                     |    262144     155573.2       1.69     262144                                                                                  |bytes: identity (no del)                                                          |bytes: identity (no del)------------------------------------------------------------                      |------------------------------------------------------------      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len------------------------------------------------------------                      |------------------------------------------------------------        64         89.2       0.72         64                                     |        64         98.8       0.65         64       100        109.5       0.91        100                                     |       100        131.9       0.76        100       127        110.4       1.15        127                                     |       127        146.3       0.87        127       256        145.8       1.76        256                                     |       256        232.9       1.10        256       500        234.8       2.13        500                                     |       500        380.8       1.31        500      1000        421.6       2.37       1000                                     |      1000        706.9       1.41       1000      1024        433.8       2.36       1024                                     |      1024        724.9       1.41       1024      4096       1335.4       3.07       4096                                     |      4096       2526.7       1.62       4096     16384       5109.1       3.21      16384                                     |     16384       9808.6       1.67      16384     65536      20334.0       3.22      65536                                     |     65536      39629.5       1.65      65536    262144      82627.2       3.17     262144                                     |    262144     156007.6       1.68     262144

Other platforms?

rerunning the benchmark on 32-bit raspbian (arm32) on a rpi5, there are still gains. I included smaller 8,20,32 sizes in this run. but the overall result is less impressive. 2%-30% at most for 64 bytes on up. slightly slower on the tiny sizes but close enough it could be in the noise. this lower spec arm probably doesn't pipeline as well or coalesce writes.

and rerunning it on a 64-bit raspbian (arm64) rpi4 (wow those feel slow these days...), much better gains than the arm32 above. closer to what x86_64 zen2 saw. 10%-170% 64 bytes through 256k. insignificant for 32 bytes and below.

Copy link

Contributor

sergey-miryanov commentedJan 23, 2026

@gpshead Whoa! Many thanks for such detailed answer!

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-144157: Optimize bytes.translate() by deferring change detection#144158

gh-144157: Optimize bytes.translate() by deferring change detection#144158

Uh oh!

Conversation

gpshead commentedJan 22, 2026•
edited by bedevere-appbot
Loading

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergey-miryanov commentedJan 22, 2026

Uh oh!

gpshead commentedJan 22, 2026

Uh oh!

sergey-miryanov commentedJan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Movatterモバイル変換

Uh oh!

gh-144157: Optimize bytes.translate() by deferring change detection#144158

gh-144157: Optimize bytes.translate() by deferring change detection#144158

Uh oh!

Conversation

gpshead commentedJan 22, 2026• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergey-miryanov commentedJan 22, 2026

Uh oh!

gpshead commentedJan 22, 2026

Uh oh!

sergey-miryanov commentedJan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gpshead commentedJan 22, 2026•
edited by bedevere-appbot
Loading