I know it isn't good practice to bundle multiple changes into the same PR, but this PR does two things which are required to get tail calling working on MSVC and which I think are beneficial.

Move the _PyStackRef off a per-ceval allocated buffer into a per-thread one. This actually reduces the C stack consumption of ceval.c. I had planned this for some time (seeMove stackref buffer to thread state to reduce interp stack usage #138115), but we also need this to let MSVC know the variables are fine to outlive their current scope. This may be a tiny bit slower versus allocating it on the stack, but it removes the performance cliff of having any function calling more than 10 arguments suddenly getting a lot slower due to having to allocate memory. Overall, I think this is a better design choice for the two reasons of less C stack consumption and allowing longer calls.
Add__restrict to correct places and scoping where needed to tell MSVC that a local variable does not outlive the scope. We only use this for MSVC for now as other compilers have different qualifier rules. However, it's definitely possible to extend this to clang and gcc to let them benefit in the future.

There is no noticeable perf drop on a nogil benchmark for the normal interpreterhttps://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251011-3.15.0a0-0786133-NOGIL

Issue:Tail calling interpreter for MSVC #139922

Fidget-Spinnerand others added7 commits

August 21, 2025 22:20

Brandt's changes

82d1259

Co-Authored-By: Brandt Bucher <brandt@python.org>

Hulon's and mine changes

3248658

Co-Authored-By: Hulon Jenkins <109993038+hulonjenkins@users.noreply.github.com>

push stackrefs into tstate

9ac430d

Fix last few remaining problems

085c1d7

fix handling of recursion

0b12f2e

Changes by Chris to support clang-cl

acf48f5

Co-Authored-By: Chris Eibl <138194463+chris-eibl@users.noreply.github.com>

Make the restrict MSVC only for now

35e96c1

Fidget-Spinner requested review froma team,ZeroIntensity,ericsnowcurrently,markshannon andmethane ascode owners

October 11, 2025 16:45

bedevere-appbot added the awaiting core review label

Oct 11, 2025

bedevere-appbot mentioned this pull request

Oct 11, 2025

Tail calling interpreter for MSVC#139922

Open

Fidget-Spinnerand others added7 commits

October 11, 2025 17:47

Merge remote-tracking branch 'upstream/main' into msvc-tailcall-take-two

d7737e9

Change restricts to MSVC only

86f19cf

📜🤖 Added by blurb_it.

40013cc

Reduce diff

48db59e

Merge branch 'msvc-tailcall-take-two' of github.com:Fidget-Spinner/cp…

bc9d23c

…ython into msvc-tailcall-take-two

Work around specialization

19e02c2

Fix VS 2026

66ec774

ZeroIntensity reviewed

Oct 11, 2025

View reviewed changes

Include/pyport.h OutdatedShow resolvedHide resolved

Fidget-Spinner added3 commits

October 12, 2025 02:27

fix a typo

50f8ff7

Move to macros to internal header

5d908b4

Merge branch 'msvc-tailcall-take-two' of github.com:Fidget-Spinner/cp…

0786133

…ython into msvc-tailcall-take-two

zooba reviewed

Oct 13, 2025

View reviewed changes

Copy link

Member

zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Very neat, looks good to me, but don't let me be the final reviewer on this.

Include/internal/pycore_tstate.h

		#endif

		/* How much scratch space to give stackref to PyObject* conversion. */
		#defineMAX_STACKREF_SCRATCH 1024

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How arbitrary is this amount? What is the usual amount of space required for conversions?

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We used to arbitrarily always consume 10 PyObject/PyStackRef per call. Meaning even if you had a single item, the whole 10 slot space would be used.

Python/bytecodes.c Outdated

		inst(LOAD_BUILD_CLASS, (--bc)) {
		PyObject*bc_o;
		interr=PyMapping_GetOptionalItem(BUILTINS(),&_Py_ID(__build_class__),&bc_o);
		PyObject*Py_MSVC_RESTRICTbc_o;

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder if there's a more semantically meaningful name thanPy_MSVC_RESTRICT here? "Unaliased" or "no escape"?

And possibly define it as a macro like#define UNALIASED(x) x Py_MSVC_RESTRICT to getUNALIASED(PyObject *) bc_o?

Just looking to reduce the sense of "oh I have to go look up this MSVC thing to know what's going on here" when reading the code. There may be other ways.

Python/bytecodes.c Outdated

Comment on lines 2294 to 2296

		int*Py_MSVC_RESTRICTmethod_found_ptr=&method_found;
		attr_o=_PySuper_Lookup(cls,self,name,
		Py_TYPE(self)->tp_getattro==PyObject_GenericGetAttr ?method_found_ptr :NULL);

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Indent? Like in other places

Python/bytecodes.c Outdated

		PyObject*stack[]= {class,self};
		PyObject*super=PyObject_Vectorcall(global_super,stack,oparg&2,NULL);
		PyObject*super;
		Py_BEGIN_LOCALS_MUST_NOT_ESCAPE();

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We usually don't require parentheses on macros like this, especially when they define a scope.

PCbuild/pythoncore.vcxproj

		<ClCompileInclude="..\Python\brc.c" />
		<ClCompileInclude="..\Python\ceval.c" />
		<ClCompileInclude="..\Python\ceval.c">
		<AdditionalOptionsCondition="'$(UseTailCallInterp)' == 'true' and $(PlatformToolset) != 'ClangCL'">/std:clatest %(AdditionalOptions)</AdditionalOptions>

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What happens if we pass this option to clang-cl? Does it break?

Any possibility of passing a specific /std:c rather than "latest"?

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So I've worked with Chris on this and apparently it builds with clang-cl still the last time I checked with him?@chris-eibl

Copy link

Member

chris-eiblOct 13, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What happens if we pass this option to clang-cl? Does it break?

Yupp, clang-cl doesn't like/std:clatest, hence I omitted it, because it doesn't need it.
Seems the clang-cl versions we are using (>=19)

support by default at least C99 (whichis would be needed forrestrict)
and support must tail, etc, in the "clang specific notation"

But I assume MSVC needs/std:clatest for [[msvc::musttail]] -@Fidget-Spinner correct?
And at least MSVC definitely needs at least/std:c11 for restrict (https://learn.microsoft.com/en-us/cpp/build/reference/std-specify-language-standard-version?view=msvc-170).

Any possibility of passing a specific /std:c rather than "latest"?

Yeah, we could explicitely enforce/std:c11for clang-cl, but we didn't need it in the past and AFAICT this PR didn't change anything that clang-cl sees, due to carefully crafted combinations of_MSC_VER and__clang__ like

#if defined(_MSC_VER) && !defined(__clang__)

and apparently it builds with clang-cl still the last time I checked with him?

Yes, it still does. Having done a lot of builds recently and still playing around.
And tail calling CI is green as well:https://github.com/python/cpython/actions/runs/18435075900/job/52527447202

Copy link

MemberAuthor

Fidget-SpinnerOct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm under the impression we need clatest for msvc musttail as well.

Copy link

Member

chris-eiblOct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've just verified again:

MSVC needs/std:clatest for[[msvc::musttail]]
using/std:c17 fails witherror C2143: syntax error: missing ';' before '['
clang-cl doesn't accept/std:clatest

So atm we're stuck with this construct until MSVC introduces a new "fixed" option (or maybe accepts[[msvc::musttail]] in/std:c17).

I agree with Steve#135927 (comment)

For the record, I don't like building with Clatest and if we're considering making that change upstream, we should pick a version.

and we should revisit this once we know a fixed version option for it.

Fidget-Spinnerand others added3 commits

October 14, 2025 04:49

Reduce number of restricts

e699d40

Co-Authored-By: Chris Eibl <138194463+chris-eibl@users.noreply.github.com>

Merge branch 'msvc-tailcall-take-two' ofhttps://github.com/Fidget-Sp…

66d6c39

…inner/cpython into msvc-tailcall-take-two

Reduce restrict use even more, reduce usage

5584fec

zooba reviewed

Oct 13, 2025

View reviewed changes

Include/internal/pycore_ceval.h Outdated

		# definePy_NO_INLINE_MSVC_TAILCALL
		#endif

		// Tells the compiler that this variable cannot be alised.

Copy link

Member

zoobaOct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Grammar nit, but "cannot be aliased" is ambiguous (because it's not clear whether we are choosing not to alias it, or telling the compiler that it cannot choose to alias it).

Ithink we're promising that it won't be aliased? So "Tells the compiler to trust that we haven't aliased this variable" better explains what it's doing. Or just "... this variable is not going to be aliased".

Alternatively, "Tells the compiler not to alias the variable" (but I don't think that even makes sense, let alone being correct here).

Copy link

MemberAuthor

Fidget-Spinner commentedOct 14, 2025

I know it isn't good practice to bundle multiple changes into the same PR
Indeed not. Can we move all therestrict annotations to their own PR? Are they MSVC specific? If therestrict annotation is correct, we should add them for all platforms. If they are not correct we shouldn't add them anywhere.

Fair. This PR will be just fixing the stackref buffer then.

markshannon reviewed

Oct 14, 2025

View reviewed changes

Include/internal/pycore_ceval.h Outdated

		#defineSPECIAL_MAX 3

		// Special counterparts of ceval functions for performance reasons
		PyAPI_FUNC(int)_PyEval_Mapping_GetOptionalItem(PyObjectobj,PyObjectkey,PyObject**result);

Copy link

Member

markshannonOct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is the only difference the restrict annotation?
We could just add the annotation toPyMapping_GetOptionalItem theresult pointer can never alias the other parameters.

Copy link

MemberAuthor

Fidget-SpinnerOct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes. We can't becausePyMapping_GetOptionalItem is part of the public API, and we can't change the signature of it.

Copy link

Member

markshannonOct 14, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We aren't changing the signature, just telling the compiler than one of its parameters can't alias the others.
I'm surprised we need to do this anyway. I would have thought that strict aliasing already tells MSVC that they can't alias.

Copy link

Member

chris-eiblOct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My understanding is, that strict aliasing can't help here, since in

PyMapping_GetOptionalItem(PyObject *obj, PyObject *key, PyObject **result)

all parameters are of typePyObject. And by usingrestrict we tell the compiler that "we assure you, that none of thesePyObjects point to the same memory", and if we brake that contract, UB kicks in.

Copy link

Member

markshannon commentedOct 14, 2025

What isnecessary for tail calling?

IIUC, the restict annotations and the tighter scoping around arrays in the interpreter are necessary, but moving the temporary arrays from the C stack to the threadstate are not. Is that correct?

Copy link

MemberAuthor

Fidget-Spinner commentedOct 14, 2025

IIUC, the restict annotations and the tighter scoping around arrays in the interpreter are necessary, but moving the temporary arrays from the C stack to the threadstate are not. Is that correct?

No this is needed too. The problem is that the temporary arrays from the C stack are local variables. These variables are passed around in ways that are unpredictable to MSVC (such as passing to an arbitrary vectorcall function pointer). So we have to move this to thread state to get tail calling.

There are other side benefits and I've wanted to do this for awhile, so I thought might as well put this in this PR. Do you still want to split up the restrict stuff knowing that we need both to get TC on MSVC?

Copy link

Member

markshannon commentedOct 15, 2025

This PR should contain everything that is necessary for tailcalling. If that includes moving the temporary arrays then move them.

But, why is it necessary to move the arrays? If this arrayhttps://github.com/python/cpython/pull/139962/files#diff-729a985b0cb8b431cb291f1edb561bbbfea22e3f8c262451cd83328a0936a342R3522 is made OK by adding parentheses, then why not the temporary arrays? In fact, why do we need to insert the parentheses in bytecodes.c at all. Why not do it in the code generator as we discussed a while ago?

Copy link

MemberAuthor

Fidget-Spinner commentedOct 15, 2025•
edited
Loading

I'm not sure what MSVC is doing under the hood here, but the last time I tried, just adding scoping around the temporary arrays doesn't work.

It seems to be more lines of code and more maintainence burden to add functionality to the cases generator to detect what is the smallest possible scope to wrap a escaping local variable around. We dont have that many cases in bytecodes.c so I think we should just do it manually. Note that we cant just add curly braces, we need to pass variables around too in certain cases.

Fidget-Spinner mentioned this pull request

Oct 18, 2025

gh-139109: A new tracing JIT compiler frontend for CPython#140310

Merged

Copy link

MemberAuthor

Fidget-Spinner commentedOct 22, 2025

@markshannon gentle ping to see the latest comment please, thanks!

Apply Chris' changes

7eeeaa8

Co-Authored-By: Chris Eibl <138194463+chris-eibl@users.noreply.github.com>

chris-eibl mentioned this pull request

Oct 24, 2025

We should fail to compile if _Py_TAIL_CALL_INTERP is set butpreserve_none andmusttail do not exist.#140513

Closed

Fidget-Spinnerand others added3 commits

November 4, 2025 20:54

Merge remote-tracking branch 'upstream/main' into msvc-tailcall-take-two

6f3d525

Merge remote-tracking branch 'upstream/main' into msvc-tailcall-take-two

81618e2

Use choco for now to get VS 2026

2008d1d

Fidget-Spinner requested review fromAA-Turner,ezio-melotti andhugovk ascode owners

November 11, 2025 22:49

Fidget-Spinner added7 commits

November 11, 2025 22:49

Update tail-call.yml

7c84388

Update tail-call.yml

68b41cf

Update tail-call.yml

c7316fc

Update tail-call.yml

9214d5b

Update tail-call.yml

52c6f9c

Update tail-call.yml

34d98d3

Update tail-call.yml

7ec626e

Copy link

MemberAuthor

Fidget-Spinner commentedNov 11, 2025

Sorry for the noise. Trying to get CI to download VS 2026 programatically and compile.

Fidget-Spinner added2 commits

November 11, 2025 23:48

Update tail-call.yml

ad1c5a2

Update tail-call.yml

4155337

Copy link

MemberAuthor

Fidget-Spinner commentedNov 11, 2025

CI wranging is too difficult. I wail wait till GHA updates itself then use that. Sorry for the noise everyone!

Fidget-Spinner added a commit that referenced this pull request

Nov 13, 2025

gh-139109: A new tracing JIT compiler frontend for CPython (GH-140310)

4fa80ce

This PR changes the current JIT model from trace projection to trace recording. Benchmarking: better pyperformance (about 1.7% overall) geomean versus currenthttps://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251108-3.15.0a1%2B-7e2bc1d-JIT/bm-20251108-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-7e2bc1d-vs-base.svg, 100% faster Richards on the most improved benchmark versus the current JIT. Slowdown of about 10-15% on the worst benchmark versus the current JIT. **Note: the fastest version isn't the one merged, as it relies on fixing bugs in the specializing interpreter, which is left to another PR**. The speedup in the merged version is about 1.1%.https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251112-3.15.0a1%2B-f8a764a-JIT/bm-20251112-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-f8a764a-vs-base.svgStats: 50% more uops executed, 30% more traces entered the last time we ran them. It also suggests our trace lengths for a real trace recording JIT are too short, as a lot of trace too long abortshttps://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251023-3.15.0a1%2B-eb73378-CLANG%2CJIT/bm-20251023-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-eb73378-pystats-vs-base.md .This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this workgh-140277The optimizer stack space check is disabled, as it's no longer valid to deal with underflow.Pros:* Ignoring the generated tracer code as it's automatically created, this is only additional 1k lines of code. The maintenance burden is handled by the DSL and code generator.* `optimizer.c` is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.* The new JIT frontend is able to handle a lot more control-flow than the old one.* Tracing is very low overhead. We use the tail calling interpreter/computed goto interpreter to switch between tracing mode and non-tracing mode. I call this mechanism dual dispatch, as we have two dispatch tables dispatching to each other. Specialization is still enabled while tracing.* Better handling of polymorphism. We leverage the specializing interpreter for this.Cons:* (For now) requires tail calling interpreter or computed gotos. This means no Windows JIT for now :(. Not to fret, tail calling is coming soon to Windows though#139962Design:* After each instruction, the `record_previous_inst` function/label is executed. This does as the name suggests.* The tracing interpreter lowers bytecode to uops directly so that it can obtain "fresh" values at the point of lowering.* The tracing version behaves nearly identical to the normal interpreter, in fact it even has specialization! This allows it to run without much of a slowdown when tracing. The actual cost of tracing is only a function call and writes to memory.* The tracing interpreter uses the specializing interpreter's deopt to naturally form the side exit chains. This allows it to side exit chain effectively, without repeating much code. We force a re-specializing when tracing a deopt.* The tracing interpreter can even handle goto errors/exceptions, but I chose to disable them for now as it's not tested.* Because we do not share interpreter dispatch, there is should be no significant slowdown to the original specializing interpreter on tailcall and computed got with JIT disabled. With JIT enabled, there might be a slowdown in the form of the JIT trying to trace.* Things that could have dynamic instruction pointer effects are guarded on. The guard deopts to a new instruction --- `_DYNAMIC_EXIT`.

StanFromIreland pushed a commit to StanFromIreland/cpython that referenced this pull request

Dec 6, 2025

pythongh-139109: A new tracing JIT compiler frontend for CPython (pyt…

6ddc353

…honGH-140310)This PR changes the current JIT model from trace projection to trace recording. Benchmarking: better pyperformance (about 1.7% overall) geomean versus currenthttps://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251108-3.15.0a1%2B-7e2bc1d-JIT/bm-20251108-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-7e2bc1d-vs-base.svg, 100% faster Richards on the most improved benchmark versus the current JIT. Slowdown of about 10-15% on the worst benchmark versus the current JIT. **Note: the fastest version isn't the one merged, as it relies on fixing bugs in the specializing interpreter, which is left to another PR**. The speedup in the merged version is about 1.1%.https://raw.githubusercontent.com/facebookexperimental/free-threading-benchmarking/refs/heads/main/results/bm-20251112-3.15.0a1%2B-f8a764a-JIT/bm-20251112-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-f8a764a-vs-base.svgStats: 50% more uops executed, 30% more traces entered the last time we ran them. It also suggests our trace lengths for a real trace recording JIT are too short, as a lot of trace too long abortshttps://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251023-3.15.0a1%2B-eb73378-CLANG%2CJIT/bm-20251023-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-eb73378-pystats-vs-base.md .This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this workpythongh-140277The optimizer stack space check is disabled, as it's no longer valid to deal with underflow.Pros:* Ignoring the generated tracer code as it's automatically created, this is only additional 1k lines of code. The maintenance burden is handled by the DSL and code generator.* `optimizer.c` is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.* The new JIT frontend is able to handle a lot more control-flow than the old one.* Tracing is very low overhead. We use the tail calling interpreter/computed goto interpreter to switch between tracing mode and non-tracing mode. I call this mechanism dual dispatch, as we have two dispatch tables dispatching to each other. Specialization is still enabled while tracing.* Better handling of polymorphism. We leverage the specializing interpreter for this.Cons:* (For now) requires tail calling interpreter or computed gotos. This means no Windows JIT for now :(. Not to fret, tail calling is coming soon to Windows thoughpython#139962Design:* After each instruction, the `record_previous_inst` function/label is executed. This does as the name suggests.* The tracing interpreter lowers bytecode to uops directly so that it can obtain "fresh" values at the point of lowering.* The tracing version behaves nearly identical to the normal interpreter, in fact it even has specialization! This allows it to run without much of a slowdown when tracing. The actual cost of tracing is only a function call and writes to memory.* The tracing interpreter uses the specializing interpreter's deopt to naturally form the side exit chains. This allows it to side exit chain effectively, without repeating much code. We force a re-specializing when tracing a deopt.* The tracing interpreter can even handle goto errors/exceptions, but I chose to disable them for now as it's not tested.* Because we do not share interpreter dispatch, there is should be no significant slowdown to the original specializing interpreter on tailcall and computed got with JIT disabled. With JIT enabled, there might be a slowdown in the form of the JIT trying to trace.* Things that could have dynamic instruction pointer effects are guarded on. The guard deopts to a new instruction --- `_DYNAMIC_EXIT`.

chris-eibl mentioned this pull request

Dec 11, 2025

GH-141362: Makeget_externals handle fetching platform-specific release artifacts#142405

Merged