Python Enhancement Proposals

Python »
PEP Index »
PEP 744

PEP 744 – JIT Compilation

Author:: Brandt Bucher <brandt at python.org>,Savannah Ostrowski <savannah at python.org>
Discussions-To:

Table of Contents

Abstract

Earlier this year, anexperimental “just-in-time” compiler was merged into CPython’smain development branch. While recent CPython releases have included othersubstantial internal changes, this addition represents a particularlysignificant departure from the way CPython has traditionally executed Pythoncode. As such, it deserves wider discussion.

This PEP aims to summarize the design decisions behind this addition, thecurrent state of the implementation, and future plans for making the JIT apermanent, non-experimental part of CPython. It doesnot seek to provide acomprehensive overview ofhow the JIT works, instead focusing on theparticular advantages and disadvantages of the chosen approach, as well asanswering many questions that have been asked about the JIT since itsintroduction.

Readers interested in learning more about the new JIT are encouraged to consultthe following resources:

Thepresentation which first introduced theJIT at the 2023 CPython Core Developer Sprint. It includes relevantbackground, a light technical introduction to the “copy-and-patch” techniqueused, and an open discussion of its design amongst the core developerspresent. Slides for this talk can be found onGitHub.
Theopen access paper originallydescribing copy-and-patch.
Theblog post by thepaper’s author detailing the implementation of a copy-and-patch JIT compilerfor Lua. While this is a great low-level explanation of the approach, notethat it also incorporates other techniques and makes implementation decisionsthat are not particularly relevant to CPython’s JIT.
Theimplementation itself.

Motivation

Until this point, CPython has always executed Python code by compiling it tobytecode, which is interpreted at runtime. This bytecode is a more-or-lessdirect translation of the source code: it is untyped, and largely unoptimized.

Since the Python 3.11 release, CPython has used a “specializing adaptiveinterpreter” (PEP 659), whichrewrites these bytecode instructions in-place with type-specialized versions as they run.This new interpreter delivers significant performance improvements, despite thefact that its optimization potential is limited by the boundaries of individualbytecode instructions. It also collects a wealth of new profiling information:the types flowing though a program, the memory layout of particular objects, andwhat paths through the program are being executed the most. In other words,what to optimize, andhow to optimize it.

Since the Python 3.12 release, CPython has generated this interpreter from aC-like domain-specific language (DSL). Inaddition to taming some of the complexity of the new adaptive interpreter, theDSL also allows CPython’s maintainers to avoid hand-writing tedious boilerplatecode in many parts of the interpreter, compiler, and standard library that mustbe kept in sync with the instruction definitions. This ability to generate largeamounts of runtime infrastructure from a single source of truth is not onlyconvenient for maintenance; it also unlocks many possibilities for expandingCPython’s execution in new ways. For instance, it makes it feasible toautomatically generate tables for translating a sequence of instructions into anequivalent sequence of smaller “micro-ops”, generate an optimizer for sequencesof these micro-ops, and even generate an entire second interpreter for executingthem.

In fact, since early in the Python 3.13 release cycle, all CPython builds haveincluded this exact micro-op translation, optimization, and execution machinery.However, it is disabled by default; the overhead of interpreting even optimizedtraces of micro-ops is just too large for most code. Heavier optimizationprobably won’t improve the situation much either, since any efficiency gainsmade by new optimizations will likely be offset by the interpretive overhead ofeven smaller, more complex micro-ops.

The most obvious strategy to overcome this new bottleneck is to staticallycompile these optimized traces. This presents opportunities to avoid severalsources of indirection and overhead introduced by interpretation. In particular,it allows the removal of dispatch overhead between micro-ops (by replacing ageneric interpreter with a straight-line sequence of hot code), instructiondecoding overhead for individual micro-ops (by “burning” the values or addressesof arguments, constants, and cached values directly into machine instructions),and memory traffic (by moving data off of heap-allocated Python frames and intophysical hardware registers).

Since much of this data varies even between identical runs of a program and theexisting optimization pipeline makes heavy use of runtime profiling information,it doesn’t make much sense to compile these traces ahead of time and would be asubstantial redesign of the existing specification and micro-op tracing infrastructurethat has already been implemented. As has been demonstrated for many other dynamiclanguages (and even Python itself), the most promisingapproach is to compile the optimized micro-ops “just in time” for execution.

Rationale

Despite their reputation, JIT compilers are not magic “go faster” machines.Developing and maintaining any sort of optimizing compiler for even a singleplatform, let alone all of CPython’s most popular supported platforms, is anincredibly complicated, expensive task. Using an existing compiler frameworklike LLVM can make this task simpler, but only at the cost of introducing heavyruntime dependencies and significantly higher JIT compilation overhead.

It’s clear that successfully compiling Python code at runtime requires not onlyhigh-quality Python-specific optimizations for the code being run,but alsoquick generation of efficient machine code for the optimized program. The Pythoncore development team has the necessary skills and experience for the former (amiddle-end tightly coupled to the interpreter), and copy-and-patch compilationprovides an attractive solution for the latter.

In a nutshell, copy-and-patch allows a high-quality template JIT compiler to begenerated from the same DSL used to generate the rest of the interpreter. For awidely-used, volunteer-driven project like CPython, this benefit cannot beoverstated: CPython’s maintainers, by merely editing the bytecode definitions,will also get the JIT backend updated “for free”, forall JIT-supportedplatforms, at once. This is equally true whether instructions are being added,modified, or removed.

Like the rest of the interpreter, the JIT compiler is generated at build time,and has no runtime dependencies. It supports a wide range of platforms (see theSupport section below), and has comparatively low maintenance burden. In all,the current implementation is made up of about 900 lines of build-time Pythoncode and 500 lines of runtime C code.

Specification

The JIT is currently not part of the default build configuration, and it islikely to remain that way for the foreseeable future (though official binariesmay include it). That said, the JIT will become non-experimental once all ofthe following conditions are met:

It provides a meaningful performance improvement for at least one popularplatform (realistically, on the order of 5%).
It can be built, distributed, and deployed with minimal disruption.
The Steering Council, upon request, has determined that it would provide morevalue to the community if enabled than if disabled (considering tradeoffssuch as maintenance burden, memory usage, or the feasibility of alternatedesigns).

These criteria should be considered a starting point, and may be expanded overtime. For example, discussion of this PEP may reveal that additionalrequirements (such as multiple committed maintainers, a security audit,documentation in the devguide, support for out-of-process debugging, or aruntime option to disable the JIT) should be added to this list.

Until the JIT is non-experimental, it shouldnot be used in production, andmay be broken or removed at any time without warning.

Once the JIT is no longer experimental, it should be treated in much the sameway as other build options such as--enable-optimizations or--with-lto.It may be a recommended (or even default) option for some platforms, and releasemanagersmay choose to enable it in official releases.

Support

The JIT has been developed for all ofPEP 11’s current tier one platforms,most of its tier two platforms, and one of its tier three platforms.Specifically, CPython’smain branch hasCIbuilding and testing the JIT for both release and debug builds on:

aarch64-apple-darwin/clang
aarch64-pc-windows/msvc[1]
aarch64-unknown-linux-gnu/clang[2]
aarch64-unknown-linux-gnu/gcc[2]
i686-pc-windows-msvc/msvc
x86_64-apple-darwin/clang
x86_64-pc-windows-msvc/msvc
x86_64-unknown-linux-gnu/clang
x86_64-unknown-linux-gnu/gcc

It’s worth noting that some platforms, even future tier one platforms, may nevergain JIT support. This can be for a variety of reasons, including insufficientLLVM support (powerpc64le-unknown-linux-gnu/gcc), inherent limitations ofthe platform (wasm32-unknown-wasi/clang), or lack of developer interest(x86_64-unknown-freebsd/clang).

Once JIT support for a platform is added (meaning, the JIT builds successfullywithout displaying warnings to the user), it should be treated in much the sameway asPEP 11 prescribes: it should have reliable CI/buildbots, and JITfailures on tier one and tier two platforms should block releases. Though it’snot necessary to updatePEP 11 to specify JIT support, it may be helpful todo so anyway. Otherwise, a list of supported platforms should be maintained inthe JIT’s README.

Since it should always be possible to build CPython without the JIT, removingJIT support for a platform shouldnot be considered a backwards-incompatiblechange. However, if it is reasonable to do so, the normal deprecation processshould be followed as outlined inPEP 387.

The JIT’s build-time dependencies may be changed between releases, withinreason.

Backwards Compatibility

Due to the fact that the current interpreter and the JIT backend are bothgenerated from the same specification, the behavior of Python code should becompletely unchanged. In practice, observable differences that have been foundand fixed during testing have tended to be bugs in the existing micro-optranslation and optimization stages, rather than bugs in the copy-and-patchstep.

Debugging

Tools that profile and debug Python code will continue to work fine. Thisincludes in-process tools that use Python-provided functionality (likesys.monitoring,sys.settrace, orsys.setprofile), as well asout-of-process tools that walk Python frames from the interpreter state.

However, it appears that profilers and debuggersfor C code are currentlyunable to trace back through JIT frames. Working with leaf frames is possible(this is how the JIT itself is debugged), though it is of limited utility due tothe absence of proper debugging information for JIT frames.

Since the code templates emitted by the JIT are compiled by Clang, itmay bepossible to allow JIT frames to be traced through by simply modifying thecompiler flags to use frame pointers more carefully. It may also be possible toharvest and emit the debugging information produced by Clang. Neither of theseideas have been explored very deeply.

While this is an issue thatshould be fixed, fixing it is not a particularlyhigh priority at this time. This is probably a problem best explored by somebodywith more domain expertise in collaboration with those maintaining the JIT, whohave little experience with the inner workings of these tools.

Security Implications

This JIT, like any JIT, produces large amounts of executable data at runtime.This introduces a potential new attack surface to CPython, since a maliciousactor capable of influencing the contents of this data is therefore capable ofexecuting arbitrary code. This is awell-known vulnerability of JITcompilers.

In order to mitigate this risk, the JIT has been written with best practices inmind. In particular, the data in question is not exposed by the JIT compiler toother parts of the program while it remains writable, and atno point is thedata bothwritableand executable.

The nature of template-based JITs also seriously limits the kinds of code thatcan be generated, further reducing the likelihood of a successful exploit. As anadditional precaution, the templates themselves are stored in static, read-onlymemory.

However, it would be naive to assume that no possible vulnerabilities exist inthe JIT, especially at this early stage. The author is not a security expert,but is available to join or work closely with the Python Security Response Teamto triage and fix security issues as they arise.

Apple Silicon

Though difficult to test without actually signing and packaging a macOS release,itappears that macOS releases shouldenable the JIT Entitlement for theHardened Runtime.

This shouldn’t makeinstalling Python any harder, but may add additional stepsfor release managers to perform.

How to Teach This

Choose the sections that best describe you:

If you are a Python programmer or end user…
- …nothing changes for you. Nobody should be distributing JIT-enabledCPython interpreters to you while it is still an experimental feature. Onceit is non-experimental, you will probably notice slightly better performanceand slightly higher memory usage. You shouldn’t be able to observe any otherchanges.
If you maintain third-party packages…
- …nothing changes for you. There are no API or ABI changes, and the JIT isnot exposed to third-party code. You shouldn’t need to change your CImatrix, and you shouldn’t be able to observe differences in the way yourpackages work when the JIT is enabled.
If you profile or debug Python code…
- …nothing changes for you. All Python profiling and tracing functionalityremains.
If you profile or debug C code…
- …currently, the ability to tracethrough JIT frames is limited. This maycause issues if you need to observe the entire C call stack, rather thanjust “leaf” frames. See theDebugging section above for more information.
If you compile your own Python interpreter….
- …if you don’t wish to build the JIT, you can simply ignore it. Otherwise,you will need toinstall a compatible version of LLVM, andpass the appropriate flag to the build scripts. Your build may take up to aminute longer. Note that the JIT shouldnot be distributed to end users orused in production while it is still in the experimental phase.
If you’re a maintainer of CPython (or a fork of CPython)…
- …and you change the bytecode definitions or the main interpreterloop…
  - …in general, the JIT shouldn’t be much of an inconvenience to you(depending on what you’re trying to do). The micro-op interpreter isn’tgoing anywhere, and still offers a debugging experience similar to whatthe main bytecode interpreter provides today. There is moderate likelihoodthat larger changes to the interpreter (such as adding new localvariables, changing error handling and deoptimization logic, or changingthe micro-op format) will require changes to the C template used togenerate the JIT, which is meant to mimic the main interpreter loop. Youmay also occasionally just get unlucky and break JIT code generation,which will require you to either modify the Python build scripts yourself,or solicit the help of somebody more familiar with them (see below).
- …and you work on the JIT itself…
  - …you hopefully already have a decent idea of what you’re gettingyourself into. You will be regularly modifying the Python build scripts,the C template used to generate the JIT, and the C code that actuallymakes up the runtime portion of the JIT. You will also be dealing withall sorts of crashes, stepping over machine code in a debugger, staring atCOFF/ELF/Mach-O dumps, developing on a wide range of platforms, andgenerally being the point of contact for the people changing the bytecodewhen CI starts failing on their PRs (see above). Ideally, you’re at leastfamiliar with assembly, have taken a couple of courses with “compilers”in their name, and have read a blog post or two about linkers.
- …and you maintain other parts of CPython…
  - …nothing changes for you. You shouldn’t need to develop locally with JITbuilds. If you choose to do so (for example, to help reproduce and triageJIT issues), your builds may take up to a minute longer each time therelevant files are modified.

Reference Implementation

Key parts of the implementation include:

Tools/jit/README.md: Instructions for how to build the JIT.
Python/jit.c: The entire runtime portion of the JIT compiler.
jit_stencils.h: An example of the JIT’s generated templates.
Tools/jit/template.c: The code which is compiled to produce the JIT’s templates.
Tools/jit/_targets.py: The code to compile and parse the templates at build time.

Rejected Ideas

Maintain it outside of CPython

While it isprobably possible to maintain the JIT outside of CPython, itsimplementation is tied tightly enough to the rest of the interpreter thatkeeping it up-to-date would probably be more difficult than actually developingthe JIT itself. Additionally, contributors working on the existing micro-opdefinitions and optimizations would need to modify and build two separateprojects to measure the effects of their changes under the JIT (whereas today,infrastructure exists to do this automatically for any proposed change).

Releases of the separate “JIT” project would probably also need to correspond tospecific CPython pre-releases and patch releases, depending on exactly whatchanges are present. Individual CPython commits between releases likely wouldn’thave corresponding JIT releases at all, further complicating debugging efforts(such as bisection to find breaking changes upstream).

Since the JIT is already quite stable, and the ultimate goal is for it to be anon-experimental part of CPython, keeping it inmain seems to be the bestpath forward. With that said, the relevant code is organized in such a way thatthe JIT can be easily “deleted” if it does not end up meeting its goals.

Turn it on by default

On the other hand, some have suggested that the JIT should be enabled by defaultin its current form.

Again, it is important to remember that a JIT is not a magic “go faster”machine; currently, the JIT is about as fast as the existing specializinginterpreter. This may sound underwhelming, but it is actually a fairlysignificant achievement, and it’s the main reason why this approach wasconsidered viable enough to be merged intomain for further development.

While the JIT provides significant gains over the existing micro-op interpreter,it isn’t yet a clear win when always enabled (especially considering itsincreased memory consumption and additional build-time dependencies). That’s thepurpose of this PEP: to clarify expectations about the objective criteria thatshould be met in order to “flip the switch”.

At least for now, having this inmain, but off by default, seems to be agood compromise between always turning it on and not having it available at all.

Support multiple compiler toolchains

Clang is specifically needed because it’s the only C compiler with support forguaranteed tail calls (musttail), which are required by CPython’scontinuation-passing-style approachto JIT compilation. Without it, the tail-recursive calls between templates couldresult in unbounded C stack growth (and eventual overflow).

Since LLVM also includes other functionalities required by the JIT build process(namely, utilities for object file parsing and disassembly), and additionaltoolchains introduce additional testing and maintenance burden, it’s convenientto only support one major version of one toolchain at this time.

Compile the base interpreter’s bytecode

Most of the prior art for copy-and-patch uses it as a fast baseline JIT, whereasCPython’s JIT is using the technique to compile optimized micro-op traces.

In practice, the new JIT currently sits somewhere between the “baseline” and“optimizing” compiler tiers of other dynamic language runtimes. This is becauseCPython uses its specializing adaptive interpreter to collect runtime profilinginformation, which is used to detect and optimize “hot” paths through the code.This step is carried out using self-modifying code, a technique which is muchmore difficult to implement with a JIT compiler.

While it’spossible to compile normal bytecode using copy-and-patch (in fact,early prototypes predated the micro-op interpreter and did exactly this), itjust doesn’t seem to provide enough optimization potential as the more granularmicro-op format.

Add GPU support

The JIT is currently CPU-only. It does not, for example, offload NumPy arraycomputations to CUDA GPUs, as JITs likeNumba do.

There is already a rich ecosystem of tools for accelerating these sorts ofspecialized tasks, and CPython’s JIT is not intended to replace them. Instead,it is meant to improve the performance of general-purpose Python code, which isless likely to benefit from deeper GPU integration.

Open Issues

Speed

Currently, the JIT isabout as fast as the existing specializing interpreteron most platforms. Improving this is obviously a top priority at this point,since providing a significant performance gain is the entire motivation forhaving a JIT at all. A number of proposed improvements are already underway, andthis ongoing work is being tracked inGH-115802.

Memory

Because it allocates additional memory for executable machine code, the JIT doesuse more memory than the existing interpreter at runtime. According to theofficial benchmarks, the JIT currently uses about10-20% more memory than thebase interpreter.The upper end of this range is due toaarch64-apple-darwin, which has largerpage sizes (and thus, a larger minimum allocation granularity).

However, these numbers should be taken with a grain of salt, as the benchmarksthemselves don’t actually have a very high baseline of memory usage. Since theyhave a higher ratio of code to data, the JIT’s memory overhead is morepronounced than it would be in a typical workload where memory pressure is morelikely to be a real concern.

Not much effort has been put into optimizing the JIT’s memory usage yet, sothese numbers likely represent a maximum that will be reduced over time.Improving this is a medium priority, and is being tracked inGH-116017. We may considerexposing configurable parameters for limiting memory consumption in thefuture, but no official APIs will be exposed until the JIT meets therequirements to be considered non-experimental.

Earlier versions of the JIT had a more complicated memory allocation schemewhich imposed a number of fragile limitations on the size and layout of theemitted code, and significantly bloated the memory footprint of Pythonexecutable. These issues are no longer present in the current design.

Dependencies

At the time of writing, the JIT has a build-time dependency on LLVM. LLVMis used to compile individual micro-op instructions into blobs of machine code,which are then linked together to form the JIT’s templates. These templates areused to build CPython itself. The JIT has no runtime dependency on LLVM and istherefore not at all exposed as a dependency to end users.

Building the JIT adds between 3 and 60 seconds to the build process, dependingon platform. It is only rebuilt whenever the generated files become out-of-date,so only those who are actively developing the main interpreter loop will berebuilding it with any frequency.

Unlike many other generated files in CPython, the JIT’s generated files are nottracked by Git. This is because they contain compiled binary code templatesspecific to not only the host platform, but also the current build configurationfor that platform. As such, hosting them would require a significant engineeringeffort in order to build and host dozens of large binary files for each committhat changes the generated code. While perhaps feasible, this is not a priority,since installing the required tools is not prohibitively difficult for mostpeople building CPython, and the build step is not particularly time-consuming.

Since some still remain interested in this possibility, discussion is beingtracked inGH-115869.

Footnotes

[1]

Due to lack of available hardware, the JIT is built, but nottested, for this platform.

[2] (1,2)

Due to lack of available hardware, the JIT is built usingcross-compilation and tested using hardware emulation for this platform. Sometests are skipped because emulation causes them to fail. However, the JIT hasbeen successfully built and tested for this platform on non-emulatedhardware.

Copyright

This document is placed in the public domain or under the CC0-1.0-Universallicense, whichever is more permissive.

Source:https://github.com/python/peps/blob/main/peps/pep-0744.rst

Last modified:2025-02-01 07:28:42 GMT

Movatterモバイル変換

PEP 744 – JIT Compilation