Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GH-115802: Optimize JIT stencils for size#136393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
brandtbucher merged 2 commits intopython:mainfrombrandtbucher:jit-os
Jul 9, 2025

Conversation

@brandtbucher
Copy link
Member

@brandtbucherbrandtbucher commentedJul 7, 2025
edited by bedevere-appbot
Loading

As the new comment says, upon manual review of-O3,-O2, and-Os, it seems that-Os generates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%:https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.md

Here's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from_POP_TOP (also reducing its size by 19%):

-    // 11: 75 04                         jne     0x17 <_JIT_ENTRY+0x17>+    // 11: 75 0f                         jne     0x22 <_JIT_ENTRY+0x22>     // 13: ff 0f                         decl    (%rdi)-    // 15: 74 07                         je      0x1e <_JIT_ENTRY+0x1e>-    // 17: 4d 8b 6c 24 40                movq    0x40(%r12), %r13-    // 1c: eb 10                         jmp     0x2e <_JIT_CONTINUE>-    // 1e: 50                            pushq   %rax-    // 1f: ff 15 00 00 00 00             callq   *(%rip)                 # 0x25 <_JIT_ENTRY+0x25>-    // 0000000000000021:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4-    // 25: 48 83 c4 08                   addq    $0x8, %rsp-    // 29: 4d 8b 6c 24 40                movq    0x40(%r12), %r13-    const unsigned char code_body[46] = {+    // 15: 75 0b                         jne     0x22 <_JIT_ENTRY+0x22>+    // 17: 50                            pushq   %rax+    // 18: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1e <_JIT_ENTRY+0x1e>+    // 000000000000001a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4+    // 1e: 48 83 c4 08                   addq    $0x8, %rsp+    // 22: 4d 8b 6c 24 40                movq    0x40(%r12), %r13+    const unsigned char code_body[39] = {         0x49, 0x8b, 0x7d, 0xf8, 0x49, 0x83, 0xc5, 0xf8,         0x4d, 0x89, 0x6c, 0x24, 0x40, 0x40, 0xf6, 0xc7,-        0x01, 0x75, 0x04, 0xff, 0x0f, 0x74, 0x07, 0x4d,-        0x8b, 0x6c, 0x24, 0x40, 0xeb, 0x10, 0x50, 0xff,-        0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc4,-        0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,+        0x01, 0x75, 0x0f, 0xff, 0x0f, 0x75, 0x0b, 0x50,+        0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83,+        0xc4, 0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,     };

Full diff for the stencils here:

https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c

@brandtbucherbrandtbucher self-assigned thisJul 7, 2025
@brandtbucherbrandtbucher added performancePerformance or resource usage skip news interpreter-core(Objects, Python, Grammar, and Parser dirs) topic-JIT labelsJul 7, 2025
@bedevere-appbedevere-appbot mentioned this pull requestJul 7, 2025
13 tasks
f"-I{CPYTHON/'Python'}",
f"-I{CPYTHON/'Tools'/'jit'}",
"-O3",
# -O2 and -O3 include some optimizations that make sense for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Did you investigate-Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.

Fidget-Spinner reacted with thumbs up emojiFidget-Spinner reacted with eyes emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice idea! I'm definitely down to try benchmarking it after this lands.

Isuspect it may be quite a bit slower, though. My understanding is that-Os does all of the meaningful performance optimizationsexcept those that increase size, while-Oz will actuallyhurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case-Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance,_POP_TWO turns from this on-Os:

    // 0000000000000000 <_JIT_ENTRY>:    // 0: 50                            pushq   %rax    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)    // 12: 40 f6 c7 01                   testb   $0x1, %dil    // 16: 75 0a                         jne     0x22 <_JIT_ENTRY+0x22>    // 18: ff 0f                         decl    (%rdi)    // 1a: 75 06                         jne     0x22 <_JIT_ENTRY+0x22>    // 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>    // 000000000000001e:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4    // 22: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)    // 28: f6 c3 01                      testb   $0x1, %bl    // 2b: 75 0d                         jne     0x3a <_JIT_ENTRY+0x3a>    // 2d: ff 0b                         decl    (%rbx)    // 2f: 75 09                         jne     0x3a <_JIT_ENTRY+0x3a>    // 31: 48 89 df                      movq    %rbx, %rdi    // 34: ff 15 00 00 00 00             callq   *(%rip)                 # 0x3a <_JIT_ENTRY+0x3a>    // 0000000000000036:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4    // 3a: 4d 8b 6c 24 40                movq    0x40(%r12), %r13    // 3f: 58                            popq    %rax

Into this on-Oz (outliningPyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):

    // 0000000000000000 <_JIT_ENTRY>:    // 0: 50                            pushq   %rax    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)    // 12: e8 16 00 00 00                callq   0x2d <PyStackRef_CLOSE>    // 17: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)    // 1d: 48 89 df                      movq    %rbx, %rdi    // 20: e8 08 00 00 00                callq   0x2d <PyStackRef_CLOSE>    // 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13    // 2a: 58                            popq    %rax    // 2b: eb 11                         jmp     0x3e <_JIT_CONTINUE>    //     // 000000000000002d <PyStackRef_CLOSE>:    // 2d: 40 f6 c7 01                   testb   $0x1, %dil    // 31: 75 04                         jne     0x37 <PyStackRef_CLOSE+0xa>    // 33: ff 0f                         decl    (%rdi)    // 35: 74 01                         je      0x38 <PyStackRef_CLOSE+0xb>    // 37: c3                            retq    // 38: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x3e <_JIT_CONTINUE>    // 000000000000003a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4

I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yep,-Oz is about 1-2% slower across the board.

@brandtbucherbrandtbucher merged commitc49dc3b intopython:mainJul 9, 2025
72 checks passed
AndPuQing pushed a commit to AndPuQing/cpython that referenced this pull requestJul 11, 2025
Pranjal095 pushed a commit to Pranjal095/cpython that referenced this pull requestJul 12, 2025
picnixz pushed a commit to picnixz/cpython that referenced this pull requestJul 13, 2025
taegyunkim pushed a commit to taegyunkim/cpython that referenced this pull requestAug 4, 2025
Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull requestAug 19, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@savannahostrowskisavannahostrowskisavannahostrowski left review comments

Assignees

@brandtbucherbrandtbucher

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usageskip newstopic-JIT

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@brandtbucher@savannahostrowski

[8]ページ先頭

©2009-2025 Movatter.jp