Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34k
Description
Feature or enhancement
Proposal:
We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.
Currently this is_BINARY_OP_ADD_INT:
// _BINARY_OP_ADD_INT_r23.o: file formatelf64-x86-64 // // Disassembly of section .text: // //0000000000000000 <_JIT_ENTRY>: //0:55 pushq %rbp //1:4883 ec10 subq$0x10, %rsp //5:4889742408movq %rsi,0x8(%rsp) // a:4889 fbmovq %rdi, %rbx // d: 4c89 fdmovq %r15, %rbp //10: 4c89 ffmovq %r15, %rdi //13:4883 e7 fe andq $-0x2, %rdi //17:4889 demovq %rbx, %rsi // 1a:4883 e6 fe andq $-0x2, %rsi // 1e: ff1500000000 callq*(%rip) #0x24 <_JIT_ENTRY+0x24> //0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 //24:4883 f801 cmpq$0x1, %rax //28:7515jne0x3f <_JIT_ENTRY+0x3f> // 2a:4989 efmovq %rbp, %r15 // 2d:4889 dfmovq %rbx, %rdi //30:48 8b742408movq0x8(%rsp), %rsi //35:4883 c410 addq$0x10, %rsp //39: 5d popq %rbp // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4 // 3f:4989 c7movq %rax, %r15 //42:4889 efmovq %rbp, %rdi //45:4889 demovq %rbx, %rsi //48:4883 c410 addq$0x10, %rsp // 4c: 5d popq %rbp
With hot-cold splitting, it will be split into:
_BINARY_OP_ADD_INT_r23.HOT: //0000000000000000 <_JIT_ENTRY>: //0:55 pushq %rbp //1:4883 ec10 subq$0x10, %rsp //5:4889742408movq %rsi,0x8(%rsp) // a:4889 fbmovq %rdi, %rbx // d: 4c89 fdmovq %r15, %rbp //10: 4c89 ffmovq %r15, %rdi //13:4883 e7 fe andq $-0x2, %rdi //17:4889 demovq %rbx, %rsi // 1a:4883 e6 fe andq $-0x2, %rsi // 1e: ff1500000000 callq*(%rip) #0x24 <_JIT_ENTRY+0x24> //0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 //24:4883 f801 cmpq$0x1, %rax //28:7515jne0x3f <_JIT_ENTRY+0x3f> // 3f:4989 c7movq %rax, %r15 //42:4889 efmovq %rbp, %rdi //45:4889 demovq %rbx, %rsi //48:4883 c410 addq$0x10, %rsp // 4c: 5d popq %rbp_BINARY_OP_ADD_INT_r23.COLD: // 2a:4989 efmovq %rbp, %r15 // 2d:4889 dfmovq %rbx, %rdi //30:48 8b742408movq0x8(%rsp), %rsi //35:4883 c410 addq$0x10, %rsp //39: 5d popq %rbp // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4
Running the current jump inversion and zero length jump removal then gives us:
_BINARY_OP_ADD_INT_r23.HOT: //0000000000000000 <_JIT_ENTRY>: //0:55 pushq %rbp //1:4883 ec10 subq$0x10, %rsp //5:4889742408movq %rsi,0x8(%rsp) // a:4889 fbmovq %rdi, %rbx // d: 4c89 fdmovq %r15, %rbp //10: 4c89 ffmovq %r15, %rdi //13:4883 e7 fe andq $-0x2, %rdi //17:4889 demovq %rbx, %rsi // 1a:4883 e6 fe andq $-0x2, %rsi // 1e: ff1500000000 callq*(%rip) #0x24 <_JIT_ENTRY+0x24> //0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 //24:4883 f801 cmpq$0x1, %rax //28:7515je _BINARY_OP_ADD_INT_r23.COLD // 3f:4989 c7movq %rax, %r15 //42:4889 efmovq %rbp, %rdi //45:4889 demovq %rbx, %rsi //48:4883 c410 addq$0x10, %rsp // 4c: 5d popq %rbp_BINARY_OP_ADD_INT_r23.COLD: // 2a:4989 efmovq %rbp, %r15 // 2d:4889 dfmovq %rbx, %rdi //30:48 8b742408movq0x8(%rsp), %rsi //35:4883 c410 addq$0x10, %rsp //39: 5d popq %rbp // 3a: e900000000jmp0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4
We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.
This builds on#142228.
In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple_BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common_BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.
I will work on this.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response