Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
gh-106581: SplitCALL_PY_EXACT_ARGS into uops#107760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
brandtbucher left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks for tackling this, it definitely doesn't look easy. It's sort of a bummer that we need to special-case this much stuff, but I also don't see a nicer way of handling these issues than what you have here.
A few comments and questions, mostly for my own understanding:
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
gvanrossum left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'll add that assert; then I'll review your PR, and hopefully you can then merge that, and I can handle the merge fallout.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Python/bytecodes.c Outdated
| DEOPT_IF(!_PyThreadState_HasStackSpace(tstate,code->co_framesize),CALL); | ||
| } | ||
| op(_INIT_CALL_PY_EXACT_ARGS, (method,callable,args[oparg]--new_frame:_PyInterpreterFrame*)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
As this makes a frame, perhaps rename it to_MAKE_FRAME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I figured there will be other uops making frames once we try to split CALL_PY_WITH_DEFAULTS, CALL_NO_KW_ALLOC_AND_ENTER_INIT, as well as BINARY_SUBSCR_GETITEM, LOAD_ATTR_PROPERTY, LOAD_ATTR_GETATTRIBUTE_OVERRIDDEN.
Anyway, uop names are easily changed.
| SKIP_OVER(INLINE_CACHE_ENTRIES_CALL); | ||
| } | ||
| op(_PUSH_FRAME, (new_frame:_PyInterpreterFrame*--unused)) { |
markshannonAug 9, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Since_PUSH_FRAME is justframe->return_offset = 0; DISPATCH_INLINED(new_frame);, it would make sense to spell outDISPATCH_INLINED to clarify which operations that need to be different for tier 1 and tier 2.
Something like:
op(_PUSH_FRAME, (new_frame:_PyInterpreterFrame*--unused)) {SAVE_FRAME_STATE();// Equivalent to frame->prev_instr = next_instr - 1; _PyFrame_SetStackPointer(frame, stack_pointer);frame->return_offset=0;new_frame->previous=frame;frame=cframe.current_frame=new_frame;CALL_STAT_INC(inlined_py_calls);if (_Py_EnterRecursivePy(tstate)) { gotoexit_unwind; }START_FRAME();// Equivalent to next_instr = frame->prev_instr + 1; stack_pointer =stack_pointer=_PyFrame_GetStackPointer(frame);}
markshannonAug 9, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For example:2b3e6f2
In which case the code generators needs to know to push the temporary stack values to the real stack beforeSAVE_FRAME_STATE()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have to study this more. A problem is that the Tier 1 and Tier 2 versions of_PUSH_FRAME are so different. I am working on a mechanism to be able to say
#if TIER_ONE<code for Tier 1>#else<code for Tier 2>#endifI'm not sure yet what you mean with your last remark about pushing temp stack values.
gvanrossumAug 9, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Comparing carefully the two versions ofDISPATCH_INLINED (addingframe->return_offset = 0 which precedes it in both cases):
In Tier 1:
frame->return_offset = 0; assert(tstate->interp->eval_frame == NULL); _PyFrame_SetStackPointer(frame, stack_pointer); frame->prev_instr = next_instr - 1; (NEW_FRAME)->previous = frame; frame = cframe.current_frame = (NEW_FRAME); CALL_STAT_INC(inlined_py_calls); goto start_frame;In Tier 2:
frame->return_offset = 0; assert(tstate->interp->eval_frame == NULL); _PyFrame_SetStackPointer(frame, stack_pointer); frame->prev_instr -= 1; (NEW_FRAME)->previous = frame; frame = tstate->cframe->current_frame = (NEW_FRAME); CALL_STAT_INC(inlined_py_calls); stack_pointer = _PyFrame_GetStackPointer(frame); ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;Diff:
@@ -1,8 +1,9 @@ frame->return_offset = 0; assert(tstate->interp->eval_frame == NULL); _PyFrame_SetStackPointer(frame, stack_pointer);- frame->prev_instr = next_instr - 1;+ frame->prev_instr -= 1; (NEW_FRAME)->previous = frame;- frame = cframe.current_frame = (NEW_FRAME);+ frame = tstate->cframe->current_frame = (NEW_FRAME); CALL_STAT_INC(inlined_py_calls);- goto start_frame;+ stack_pointer = _PyFrame_GetStackPointer(frame);+ ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
Note that the Tier 2 version must be preceded by aSAVE_IP, which does the equivalent offrame->prev_instr = next_instr. If we had a Tier 1 version ofSAVE_IP we could include it in the macro definition:
macro(CALL_PY_EXACT_ARGS) = unused/1 + // Skip over the counter _CHECK_PEP_523 + _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + SAVE_IP + // <-------------- added _PUSH_FRAME;which would reduce the special-casing in the code generator a bit (it would still need to do something special forSAVE_IP to ensure that itsoparg has the right value, different from theoparg of the macro (which is the argument count). This would take care of the first diff chunk (what to assign toframe->prev_inst), but it would still be pretty fragile. (Like my current version, it would entice the optimizer to incorrectly try to remove theSAVE_IP uop.)
The second diff chunk relates to how we setcframe.current_frame -- in Tier 2 we must access this through thetstate.
The third and final diff chunk relates to really start using the new frame. In Tier 1, this must actually do the following:
- Check recursion
- Load
stack_pointer - Load
next_instr - Dispatch to the next opcode.
This is done by the code atstart_frame.
In Tier 2 there is nostart_frame label (the only uop that can go to a label isEXIT_TRACE, and of courseDEOPT_IF andERROR_IF also jump). So we loadstack_frame here. There is no direct equivalent tonext_instr, but we have to setip_offset, whichSAVE_IP adds to itsoparg to get theprev_instr value. (This variable is a cache forframe->code->co_code_adaptive, to save some memory loads, so wheneverframe changes we must update it.)
(More later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There's another thing though, and I think that is what Mark meant. In Tier 1 the code generation for macros is special-cased for_PUSH_FRAME so that both the stack adjustmentand thenext_instr adjustment are emittedbefore the_PUSH_FRAME opcode. This is done so that the flushing of these variables to the frame in theDISPATCH_INLINED macro flush the correct values.
But this is really ugly and unprincipled, and the logic is much hairier than the other special cases for_PUSH_FRAME. One of Mark's ideas here is to make this special case look for uops using theSAVE_FRAME_STATE macro rather than for the specific uop_PUSH_FRAME. But detecting when to trigger the special case is only part of the problem -- IMO the worse problem is that the special case itself is so ugly:
dispatch_inlined_special_case = False if mgr is managers[-1] and mgr.instr.always_exits.startswith("DISPATCH_INLINED") and mgr.instr.name == "_PUSH_FRAME": dispatch_inlined_special_case = True temp = mgr.final_offset.clone() temp.deeper(StackEffect(UNUSED)) # Hack out.stack_adjust(temp.deep, temp.high) # Use clone() since adjust_inverse() mutates final_offset mgr.adjust_inverse(mgr.final_offset.clone()) if cache_adjust: out.emit(f"next_instr += {cache_adjust};")The last 4 lines here, starting with# Use clone(), occur further down too, for the normal case (after the final uop). I don't even recall why thetemp.deeper() call is needed!
I'll mull this over some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think I have addressed this.@markshannon Please have another look. Assuming the tests look okay I'll un-draft this.
Uh oh!
There was an error while loading.Please reload this page.
gvanrossum commentedAug 9, 2023
Made this back into a draft; I need to (a) wait for Brandt'sgh-107788, then (b) redo the split and tooling changes using Mark's ideas. |
brandtbucher commentedAug 9, 2023
The |
This is only the first step for doing `CALL` in Tier 2.The next step involves tracing into the called code object.After that we'll have to do the remaining `CALL` specialization.Finally we'll have to tweak various things like `KW_NAMES`,and possibly move the `NULL` (for method calls) *above* the callable.But those are things for future PRs.Note: this moves setting `frame->return_offset` directly in front of`DISPATCH_INLINED()`, to make it easier to move it into `_PUSH_FRAME`.
ambv commentedAug 11, 2023
Closing and re-opening to retrigger CLA checks. Sorry for the noise. |
Instead, the special case is an opcode using SAVE_FRAME_STATE().Introducing #if TIER_ONE and #if TIER_TWO so we can implement_PUSH_FRAME differently for both tiers.
Instead, we special-case SAVE_IP:- Its Tier 2 expansion sets oparg to the instruction offset- In Tier 1 it is a no-op (and skipped if present in a macro)
gvanrossum commentedAug 16, 2023
@markshannon I was hoping you'd review this. I added Unless you'd rather review#107925, which includes this (and#107793, which is the intermediate stage). |
markshannon left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm uneasy about the introduction of theTIER_ONE andTIER_TWO macros.
It is a principle of the overall design that there is a single source of truth for the semantics of bytecodes.
It might appear that I'm being dogmatic, but the need for something like those macros often indicates an underlying problem that should be fixed independently.
In this case the problem is thecframe. Loading and saving the IP needs to handled specially anyway and saving and loading the SP should be the same for both interpreters (but will need to be handled specially by the copy-and-patch compiler, so should be its own micro-op).
It is pushes the frame that differs. Removingcframe will fix that.
Thecframe only exists as a performance hack to minimize the impact of tracing prior to PEP 669.
Python/executor.c Outdated
| #defineENABLE_SPECIALIZATION 0 | ||
| #undef SAVE_FRAME_STATE | ||
| #defineSAVE_FRAME_STATE() \ |
markshannonAug 16, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Rather than a macro, I think the code generator needs to understand this.
Given thatSAVE_FRAME_STATE is basicallySAVE_CURRENT_IP followed by_PyFrame_SetStackPointer(frame, stack_pointer); we could convert it to two micro-ops:SAVE_CURRENT_IP andSAVE_SP.
In general, we want to avoid macros in the generated C code.
The generated code can be explicit and verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Well, there are already many macros (and static inline functions) in the generated code. The generatorrecognizes the presence ofSAVE_FRAME_STATE(), but it doesn'texpand it -- the C preprocessor can do that for us more easily. Currently we only do the expansion in the generator for things whose expansion requires information that only the generator has (like the stack adjustment forERROR_IF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You are proposing that the macro expansion forCALL_PY_EXACT_ARGS become
macro(CALL_PY_EXACT_ARGS) = unused/1 + // Skip over the counter _CHECK_PEP_523 + _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + SAVE_IP + // Tier 2 only; special-cased oparg SAVE_CURRENT_IP + // <------------------- Addition _PUSH_FRAME;whereSAVE_CURRENT_IP is something like this:
op(SAVE_CURRENT_IP, (--)) { #if TIER_ONE frame->prev_instr = next_instr - 1; #endif #if TIER_TWO frame->prev_instr--; #endif}Or we could special-case its expansion in the generator, potayto-potato. But ithas to differ between tiers because in Tier 1 it must store next_instr whereas in Tier 2 it must rely on the preceding SAVE_IP to setframe->prev_instr. (Ideally at some point in the future we won't need theprev_instr-- yet, but that's a tricky change.)
The_PyFrame_SetStackPointer(frame, stack_pointer); call should be moved back into_PUSH_FRAME (at the point where I currently callSAVE_FRAME_STATE).
If I can get this to work I'll apply it and merge the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Idid get this working (see05af848), and will test and benchmark it before merging it.
Note that there are still some#if TIER_ONE and#if TIER_TWO sections, but they are unavoidable.
Python/ceval_macros.h Outdated
| DISPATCH_GOTO(); \ | ||
| } | ||
| #defineSAVE_FRAME_STATE() \ |
markshannonAug 16, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
See my comment below about splitting this intoSAVE_CURRENT_IP; SAVE_SP
gvanrossum commentedAug 16, 2023
Benchmark: 1.00x faster:https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20230816-3.13.0a0-05af848 IOW it doesn't slow CALL_PY_EXACT_ARGS down, which is all I care about. |
bedevere-bot commentedAug 17, 2023
|
This finishes the work begun ingh-107760. When, while projecting a superblock, we encounter a call to a short, simple function, the superblock will now enter the function using `_PUSH_FRAME`, continue through it, and leave it using `_POP_FRAME`, and then continue through the original code. Multiple frame pushes and pops are even possible. It is also possible to stop appending to the superblock in the middle of a called function, when running out of space or encountering an unsupported bytecode.
Uh oh!
There was an error while loading.Please reload this page.
This is only the first step for doing
CALLin Tier 2. The next step involves tracing into the called code object. After that we'll have to do the remainingCALLspecialization. Finally we'll have to tweak various things likeKW_NAMES, and possibly move theNULL(for method calls)above the callable (that's 107788). But those are things for future PRs.Note: this moves setting
frame->return_offsetdirectly in front ofDISPATCH_INLINED(), to make it easier to move it into_PUSH_FRAME.CALL's stack #107788_Py_EnterRecursivePy