NotificationsYou must be signed in to change notification settings
Fork34k
Star71.3k

Hot-cold splitting for JIT stencils #143158

Open

Hot-cold splitting for JIT stencils#143158

Assignees

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetopic-JITtype-featureA feature request or enhancement

Description

Fidget-Spinner

opened

on Dec 25, 2025

Feature or enhancement

Proposal:

We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.

Currently this is_BINARY_OP_ADD_INT:

    // _BINARY_OP_ADD_INT_r23.o:      file formatelf64-x86-64    //    // Disassembly of section .text:    //    //0000000000000000 <_JIT_ENTRY>:    //0:55                            pushq   %rbp    //1:4883 ec10                   subq$0x10, %rsp    //5:4889742408movq    %rsi,0x8(%rsp)    // a:4889 fbmovq    %rdi, %rbx    // d: 4c89 fdmovq    %r15, %rbp    //10: 4c89 ffmovq    %r15, %rdi    //13:4883 e7 fe                   andq    $-0x2, %rdi    //17:4889 demovq    %rbx, %rsi    // 1a:4883 e6 fe                   andq    $-0x2, %rsi    // 1e: ff1500000000             callq*(%rip)                 #0x24 <_JIT_ENTRY+0x24>    //0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4    //24:4883 f801                   cmpq$0x1, %rax    //28:7515jne0x3f <_JIT_ENTRY+0x3f>    // 2a:4989 efmovq    %rbp, %r15    // 2d:4889 dfmovq    %rbx, %rdi    //30:48 8b742408movq0x8(%rsp), %rsi    //35:4883 c410                   addq$0x10, %rsp    //39: 5d                            popq    %rbp    // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f>    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4    // 3f:4989 c7movq    %rax, %r15    //42:4889 efmovq    %rbp, %rdi    //45:4889 demovq    %rbx, %rsi    //48:4883 c410                   addq$0x10, %rsp    // 4c: 5d                            popq    %rbp

With hot-cold splitting, it will be split into:

_BINARY_OP_ADD_INT_r23.HOT:    //0000000000000000 <_JIT_ENTRY>:    //0:55                            pushq   %rbp    //1:4883 ec10                   subq$0x10, %rsp    //5:4889742408movq    %rsi,0x8(%rsp)    // a:4889 fbmovq    %rdi, %rbx    // d: 4c89 fdmovq    %r15, %rbp    //10: 4c89 ffmovq    %r15, %rdi    //13:4883 e7 fe                   andq    $-0x2, %rdi    //17:4889 demovq    %rbx, %rsi    // 1a:4883 e6 fe                   andq    $-0x2, %rsi    // 1e: ff1500000000             callq*(%rip)                 #0x24 <_JIT_ENTRY+0x24>    //0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4    //24:4883 f801                   cmpq$0x1, %rax    //28:7515jne0x3f <_JIT_ENTRY+0x3f>    // 3f:4989 c7movq    %rax, %r15    //42:4889 efmovq    %rbp, %rdi    //45:4889 demovq    %rbx, %rsi    //48:4883 c410                   addq$0x10, %rsp    // 4c: 5d                            popq    %rbp_BINARY_OP_ADD_INT_r23.COLD:    // 2a:4989 efmovq    %rbp, %r15    // 2d:4889 dfmovq    %rbx, %rdi    //30:48 8b742408movq0x8(%rsp), %rsi    //35:4883 c410                   addq$0x10, %rsp    //39: 5d                            popq    %rbp    // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f>    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

Running the current jump inversion and zero length jump removal then gives us:

_BINARY_OP_ADD_INT_r23.HOT:    //0000000000000000 <_JIT_ENTRY>:    //0:55                            pushq   %rbp    //1:4883 ec10                   subq$0x10, %rsp    //5:4889742408movq    %rsi,0x8(%rsp)    // a:4889 fbmovq    %rdi, %rbx    // d: 4c89 fdmovq    %r15, %rbp    //10: 4c89 ffmovq    %r15, %rdi    //13:4883 e7 fe                   andq    $-0x2, %rdi    //17:4889 demovq    %rbx, %rsi    // 1a:4883 e6 fe                   andq    $-0x2, %rsi    // 1e: ff1500000000             callq*(%rip)                 #0x24 <_JIT_ENTRY+0x24>    //0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4    //24:4883 f801                   cmpq$0x1, %rax    //28:7515je    _BINARY_OP_ADD_INT_r23.COLD    // 3f:4989 c7movq    %rax, %r15    //42:4889 efmovq    %rbp, %rdi    //45:4889 demovq    %rbx, %rsi    //48:4883 c410                   addq$0x10, %rsp    // 4c: 5d                            popq    %rbp_BINARY_OP_ADD_INT_r23.COLD:    // 2a:4989 efmovq    %rbp, %r15    // 2d:4889 dfmovq    %rbx, %rdi    //30:48 8b742408movq0x8(%rsp), %rsi    //35:4883 c410                   addq$0x10, %rsp    //39: 5d                            popq    %rbp    // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f>    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.

This builds on#142228.

In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple_BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common_BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.

I will work on this.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Metadata

Assignees

markshannon

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetopic-JITtype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hot-cold splitting for JIT stencils #143158

Description

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions