Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 703 – Making the Global Interpreter Lock Optional in CPython

PEP 703 – Making the Global Interpreter Lock Optional in CPython

Author:
Sam Gross <colesbury at gmail.com>
Sponsor:
Łukasz Langa <lukasz at python.org>
Discussions-To:
Discourse thread
Status:
Accepted
Type:
Standards Track
Created:
09-Jan-2023
Python-Version:
3.13
Post-History:
09-Jan-2023,04-May-2023
Resolution:
24-Oct-2023

Table of Contents

Note

The Steering Council accepts PEP 703, but with clear proviso: thatthe rollout be gradual and break as little as possible, and that we can rollback any changes that turn out to be too disruptive – which includespotentially rolling back all of PEP 703 entirely if necessary(however unlikely or undesirable we expect that to be).

Abstract

CPython’s global interpreter lock (“GIL”) prevents multiple threadsfrom executing Python code at the same time. The GIL is an obstacleto using multi-core CPUs from Python efficiently. This PEP proposesadding a build configuration (--disable-gil) to CPython to let itrun Python code without the global interpreter lock and with thenecessary changes needed to make the interpreter thread-safe.

Motivation

The GIL is a major obstacle to concurrency. For scientific computingtasks, this lack of concurrency is often a bigger issue than speed ofexecuting Python code, since most of the processor cycles are spentin optimized CPU or GPU kernels. The GIL introduces a globalbottleneck that can prevent other threads from making progress ifthey call any Python code. There are existing ways to enableparallelism in CPython today, but those techniques come withsignificant limitations (seeAlternatives).

This section focuses on the GIL’s impact on scientific computing,particular AI/ML workloads because that is the area with which thisauthor has the most experience, but the GIL also affects other usersof Python.

The GIL Makes Many Types of Parallelism Difficult to Express

Neural network-based AI models expose multiple opportunities forparallelism. For example, individual operations may be parallelizedinternally (“intra-operator”), multiple operations may be executedsimultaneously (“inter-operator”), and requests (spanning multipleoperations) may also be parallelized. Efficient execution requiresexploiting multiple types of parallelism[1].

The GIL makes it difficult to express inter-operator parallelism, aswell as some forms of request parallelism, efficiently in Python. Inother programming languages, a system might use threads to rundifferent parts of a neural network on separate CPU cores, but this isinefficient in Python due to the GIL. Similarly, latency-sensitiveinference workloads frequently use threads to parallelize acrossrequests, but face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Pythonfrequently come up in reinforcement learning. Heinrich Kuttler,author of the NetHack Learning Environment and Member of TechnicalStaff at Inflection AI, writes:

Recent breakthroughs in reinforcement learning, such as onDota2,StarCraft, andNetHack rely on running multipleenvironments (simulated games) in parallel using asynchronousactor-critic methods. Straightforward multithreaded implementationsin Python don’t scale beyond more than a few parallel environmentsdue to GIL contention. Multiprocessing, with communication viashared memory or UNIX sockets, adds much complexity and in effectrules out interacting with CUDA from different workers, severelyrestricting the design space.

Manuel Kroiss, software engineer at DeepMind on the reinforcementlearning team, describes how the bottlenecks posed by the GIL lead torewriting Python codebases in C++, making the code less accessible:

We frequently battle issues with the Python GIL at DeepMind. In manyof our applications, we would like to run on the order of 50-100threads per process. However, we often see that even with fewerthan 10 threads the GIL becomes the bottleneck. To work around thisproblem, we sometimes use subprocesses, but in many cases theinter-process communication becomes too big of an overhead. Todeal with the GIL, we usually end up translating large parts of ourPython codebase into C++. This is undesirable because it makes thecode less accessible to researchers.

Projects that involve interfacing with multiple hardware devices facesimilar challenges: efficient communication requires use of multipleCPU cores. TheDose-3D project aims to improve cancerradiotherapy with precise dose planning. It uses medical phantoms(stand-ins for human tissue) together with custom hardware and aserver application written in Python. Paweł Jurgielewicz, leadsoftware architect for the data acquisition system on the Dose-3Dproject, describes the scaling challenges posed by the GIL and howusing a fork of Python without the GIL simplified the project:

In the Dose-3D project, the key challenge was to maintain a stable,non-trivial concurrent communication link with hardware units whileutilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally,we started with the multiprocessing package, but at some point, itbecame clear that most CPU time was consumed by the data transfersbetween the data processing stages, not by data processing itself.The CPython multithreading implementation based on GIL was a deadend too. When we found out about the “nogil” fork of Python it tooka single person less than half a working day to adjust the codebaseto use this fork and the results were astonishing. Now we can focuson data acquisition system development rather than fine-tuning dataexchange algorithms.

Allen Goodman, author ofCellProfiler and staff engineer atPrescient Design and Genentech, describes how the GIL makesbiological methods research more difficult in Python:

Issues with Python’s global interpreter lock are a frequent sourceof frustration throughout biological methods research.

I wanted to better understand the current multithreading situationso I reimplemented parts ofHMMER, a standard method formultiple-sequence alignment. I chose this method because itstresses both single-thread performance (scoring) andmulti-threaded performance (searching a database of sequences). TheGIL became the bottleneck when using only eight threads. This is amethod where the current popular implementations rely on 64 oreven 128 threads per process. I tried moving to subprocesses butwas blocked by the prohibitive IPC costs. HMMER is a relativelyelementary bioinformatics method and newer methods have far biggermulti-threading demands.

Method researchers are begging to use Python (myself included),because of its ease of use, the Python ecosystem, and because “it’swhat people know.” Many biologists only know a little bit ofprogramming (and that’s almost always Python). Until Python’smultithreading situation is addressed, C and C++ will remain thelingua franca of the biological methods research community.

The GIL Affects Python Library Usability

The GIL is a CPython implementation detail that limits multithreadedparallelism, so it might seem unintuitive to think of it as ausability issue. However, library authors frequently care a greatdeal about performance and will design APIs that support workingaround the GIL. These workaround frequently lead to APIs that aremore difficult to use. Consequently, users of these APIs mayexperience the GIL as ausability issue and not just a performanceissue.

For example, PyTorch exposes a multiprocessing-based API calledDataLoader for building data input pipelines. It usesfork()on Linux because it is generally faster and uses less memorythanspawn(), but this leads to additional challenges for users:creating aDataLoader after accessing a GPU can lead to confusingCUDA errors. Accessing GPUs within aDataLoader worker quicklyleads to out-of-memory errors because processes do not share CUDAcontexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria,describes how having to work around the GIL in scikit-learn relatedlibraries leads to a more complex and confusing user experience:

Over the years, scikit-learn developers have maintained ancillarylibraries such asjoblib andloky to try to work around someof the limitations of multiprocessing: extra memory usage partiallymitigated via semi-automated memory mapping of large data buffers,slow worker startup by transparently reusing a pool of longrunning workers, fork-safety problems of third-party native runtimelibraries such as GNU OpenMP by never using the fork-onlystart-method, ability to perform parallel calls of interactivelydefined functions in notebooks and REPLs in cross-platform mannervia cloudpickle. Despite our efforts, this multiprocessing-basedsolution is still brittle, complex to maintain and confusing todatascientists with limited understanding of system-levelconstraints. Furthermore, there are still irreducible limitationssuch as the overhead caused by the pickle-basedserialization/deserialization steps required for inter-processcommunication. A lot of this extra work and complexity would not beneeded anymore if we could use threads without contention onmulticore hosts (sometimes with 64 physical cores or more) to rundata science pipelines that alternate between Python-leveloperations and calls to native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPymaintainer, describes how the GIL affects the user experience ofNumPy and numeric Python libraries:

A key problem in NumPy and the stack of packages built around it isthat NumPy is still (mostly) single-threaded — and that has shapedsignificant parts of the user experience and projects built aroundit. NumPy does release the GIL in its inner loops (which do theheavy lifting), but that is not nearly enough. NumPy doesn’t offera solution to utilize all CPU cores of a single machine well, andinstead leaves that to Dask and other multiprocessing solutions.Those aren’t very efficient and are also more clumsy to use. Thatclumsiness comes mainly in the extra abstractions and layers theusers need to concern themselves with when using, e.g.,dask.array which wrapsnumpy.ndarray. It also shows up inoversubscription issues that the user must explicitly be aware ofand manage via either environment variables or a third package,threadpoolctl. The main reason is that NumPy calls into BLASfor linear algebra - and those calls it has no control over, theydo use all cores by default via either pthreads or OpenMP.

Coordinating on APIs and design decisions to control parallelism isstill a major amount of work, and one of the harder challengesacross the PyData ecosystem. It would have looked a lot different(better, easier) without a GIL.

GPU-Heavy Workloads Require Multi-Core Processing

Many high-performance computing (HPC) and AI workloads make heavy useof GPUs. These applications frequently require efficient multi-coreCPU execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR(Meta AI), describes how the GIL makes multithreaded scalinginefficient even when the bulk of computation is performed outside ofPython:

In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64CPU threads, growing to 4k GPUs and 32k CPU threads for big models.While the heavy lifting is done outside of Python, the speed ofGPUs makes even just the orchestration in Python not scalable. Weoften end up with 72 processes in place of one because of the GIL.Logging, debugging, and performance tuning are orders-of-magnitudemore difficult in this regime, continuously causing lower developerproductivity.

The use of many processes (instead of threads) makes common tasks moredifficult. Zachary DeVito continues:

On three separate occasions in the past couple of months(reducing redundant compute in data loaders, writing modelcheckpoints asynchronously, and parallelizing compileroptimizations), I spent an order-of-magnitude more time figuringout how to work around GIL limitations than actually solving theparticular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component.For example, computer vision tasks typically requiremultiple “pre-processing” steps in the data input pipeline, likeimage decoding, cropping, and resizing. These tasks are commonlyperformed on the CPU and may use Python libraries likePilloworPillow-SIMD. It is necessary to run the data input pipelineon multiple CPU cores in order to keep the GPU “fed” with data.

The increase in GPU performance compared to individual CPU cores makesmulti-core performance more important. It is progressively moredifficult to keep the GPUs fully occupied. To do so requires efficientuse of multiple CPU cores, especially on multi-GPU systems. Forexample, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order tokeep the GPUs “fed” with data.

The GIL Makes Deploying Python AI Models Difficult

Python is widely used to develop neural network-based AI models. InPyTorch, models are frequently deployed as part of multi-threaded,mostly C++, environments. Python is often viewed skepticallybecause the GIL can be a global bottleneck, preventing efficientscaling even though the vast majority of the computationsoccur “outside” of Python with the GIL released. The torchdeploypaper[2] shows experimental evidence for these scalingbottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AImodels that avoid or work around the GIL, but they all come withsubstantial limitations. For example,TorchScript captures arepresentation of the model that can be executed from C++ without anyPython dependencies, but it only supports a limited subset of Pythonand often requires rewriting some of the model’s code. Thetorch::deploy APIallows multiple Python interpreters, each with its own GIL, in thesame process(similar toPEP 684). However,torch::deploy haslimited support for Python modules that use C-API extensions.

Motivation Summary

Python’s global interpreter lock makes it difficult to use modernmulti-core CPUs efficiently for many scientific and numeric computingapplications. Heinrich Kuttler, Manuel Kroiss, and PawełJurgielewicz found that multi-threaded implementations in Python didnot scale well for their tasks and that using multiple processeswas not a suitable alternative.

The scaling bottlenecks are not solely in core numeric tasks. BothZachary DeVito and Paweł Jurgielewicz described challenges withcoordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how currentworkarounds for the GIL are “complex to maintain” and cause “lowerdeveloper productivity.” The GIL makes it more difficult to developand maintain scientific and numeric computing libraries as wellleading to library designs that are more difficult to use.

Specification

Build Configuration Changes

The global interpreter lock will remain the default for CPython buildsand python.org downloads. A new build configuration flag,--disable-gil will be added to the configure script that will buildCPython with support for running without the global interpreter lock.

When built with--disable-gil, CPython will define thePy_GIL_DISABLEDmacro in Python/patchlevel.h. The ABI tag will include the letter “t”(for “threading”).

The--disable-gil builds of CPython will still support optionallyrunning with the GIL enabled at runtime (seePYTHONGIL EnvironmentVariable andPy_mod_gil Slot).

Overview of CPython Changes

Removing the global interpreter lock requires substantial changes toCPython internals, but relatively few changes to the public Pythonand C APIs. This section describes the required changes to theCPython implementation followed by the proposed API changes.

The implementation changes can be grouped into the following fourcategories:

  • Reference counting
  • Memory management
  • Container thread-safety
  • Locking and atomic APIs

Reference Counting

Removing the GIL requires changes to CPython’sreference counting implementation to make it thread-safe.Furthermore, it needs to have low execution overhead and allow forefficient scaling with multiple threads. This PEP proposes acombination of three techniques to address these constraints. Thefirst is a switch from plain non-atomic reference counting to biasedreference counting, which is a thread-safe reference countingtechnique with lower execution overhead than plain atomic referencecounting. The other two techniques are immortalization and a limitedform of deferred reference counting; they address some of themulti-threaded scalability issues with reference counting by avoidingsome reference count modifications.

Biased reference counting (BRC) is a technique first described in 2018by Jiho Choi, Thomas Shull, and Josep Torrellas[3]. It is based on theobservation that most objects are only accessed by a single thread,even in multi-threaded programs. Each object is associated with anowning thread (the thread that created it). Reference countingoperations from the owning thread use non-atomic instructions tomodify a “local” reference count. Other threads use atomicinstructions to modify a “shared” reference count. This design avoidsmany atomic read-modify-write operations that are expensive oncontemporary processors.

The implementation of BRC proposed in this PEP largely matches theoriginal description of biased reference counting, but differs indetails like the size of reference counting fields and special bits inthose fields. BRC requires storing three pieces of information in eachobject’s header: the “local” reference count, the “shared” referencecount, and the identifier of the owning thread. The BRC paper packsthese three things into a single 64-bit field. This PEP proposes usingthree separate fields in each object’s header to avoid potential issuesdue to reference count overflow. Additionally, the PEP supports afaster deallocation path that avoids an atomic operation in the commoncase.

The proposedPyObject struct (also calledstruct_object) isbelow:

struct_object{_PyObject_HEAD_EXTRAuintptr_tob_tid;// owning thread id (4-8 bytes)uint16_t__padding;// reserved for future use (2 bytes)PyMutexob_mutex;// per-object mutex (1 byte)uint8_tob_gc_bits;// GC fields (1 byte)uint32_tob_ref_local;// local reference count (4 bytes)Py_ssize_tob_ref_shared;// shared reference count and state bits (4-8 bytes)PyTypeObject*ob_type;};

Theob_tid,ob_ref_local, andob_ref_shared are used bythe biased reference counting implementation. Theob_gc_bits fieldis used store garbage collection flags that were previously stored inPyGC_Head (seeGarbage Collection (Cycle Collection)). Theob_mutex field provides a per-object lock in a single byte.

Immortalization

Some objects, such as interned strings, small integers, staticallyallocated PyTypeObjects, and theTrue,False, andNoneobjects stay alive for the lifetime of the program. These objects aremarked as immortal by setting the local reference count field(ob_ref_local) toUINT32_MAX.

ThePy_INCREF andPy_DECREF macros are no-ops for immortalobjects. This avoids contention on the reference count fields ofthese objects when multiple threads access them concurrently.

This proposed immortalization scheme is very similar toPEP 683,adopted in Python 3.12, but with slightly different bit representationin the reference count fields for immortal objects in order to workwith biased reference counting and deferred reference counting. SeealsoWhy Not Use PEP 683 Immortalization?.

Biased Reference Counting

Biased reference counting has a fast-path for objects “owned” by thecurrent thread and a slow-path for other objects. Ownership isindicated by theob_tid field. Determining the thread id requiresplatform specific code[5]. A value of0 inob_tidindicates that the object is not owned by any thread.

Theob_ref_local field stores the local reference count and twoflags. The two most significant bits are used to indicate the objectis immortal or uses deferred reference counting (seeDeferredreference counting).

Theob_ref_shared field stores the shared reference count. Thetwoleast significant bits are used to store the referencecounting state. The shared reference count is therefore shifted left bytwo. Theob_ref_shared field uses the least significant bitsbecause the shared reference count can be temporarily negative; increfsand decrefs may not be balanced between threads.

The possible reference counting states are listed below:

  • 0b00 - default
  • 0b01 - weakrefs
  • 0b10 - queued
  • 0b11 - merged

The states form a progression: during their lifecycle, objects maytransition to any numerically higher state. Objects can only bedeallocated from the “default” and “merged” states. Other states musttransition to the “merged” state before deallocation. Transitioningstates requires an atomic compare-and-swap on theob_ref_sharedfield.

Default (0b00)

Objects are initially created in the default state. This is the onlystate that allows for the quick deallocation code path. Otherwise, thethread must merge the local and shared reference count fields, whichrequires an atomic compare-and-swap.

This quick deallocation code path would not be thread-safe withconcurrent dereferencing of weakrefs, so the first time a weakreference is created, the object is transitioned to the “weakrefs”state if it is currently in the “default” state.

Similarly, the quick deallocation code path would not be thread-safewith the lockless list and dictionary accesses (seeOptimisticallyAvoiding Locking), so the first time a non-owning thread threadattempts to retrieve an object in the “default” state it falls back tothe slower locking code path and transitions the object tothe “weakrefs” state.

Weakrefs (0b01)

Objects in weakref and higher states support dereferencing weakrefsas well as the lockless list and dictionary access by non-owningthreads. They require transitioning to the merged state beforedeallocation, which is more expensive than the quick deallocation codepath supported by the “default” state.

Queued (0b10)

The queued state indicates that the a non-owning thread has requestedthat the reference count fields be merged. This can happen when theshared reference count becomes negative (due to an imbalance betweenincrefs and decrefs between threads). The object is inserted into theowning thread’s queue of objects to be merged. The owning thread isnotified via theeval_breaker mechanism. In practice, thisoperation is rare. Most objects are only accessed by a single threadand those objects accessed by multiple threads rarely have negativeshared reference counts.

If the owning thread has terminated, the acting thread immediatelymerges the local and shared reference count fields and transitions tothe merged state.

Merged (0b11)

The merged state indicates that the object is not owned by any thread.Theob_tid field is zero in this state andob_ref_local is notused. Once the shared reference count reaches zero, the object canbe deallocated from the merged state.

Reference counting pseudo-code

The proposedPy_INCREF andPy_DECREF operation should behaveas follows (using C-like pseudo-code):

// low two bits of "ob_ref_shared" are used for flags#define _Py_SHARED_SHIFT 2voidPy_INCREF(PyObject*op){uint32_tnew_local=op->ob_ref_local+1;if(new_local==0)return;// object is immortalif(op->ob_tid==_Py_ThreadId())op->ob_ref_local=new_local;elseatomic_add(&op->ob_ref_shared,1<<_Py_SHARED_SHIFT);}voidPy_DECREF(PyObject*op){if(op->ob_ref_local==_Py_IMMORTAL_REFCNT){return;// object is immortal}if(op->ob_tid==_Py_ThreadId()){op->ob_ref_local-=1;if(op->ob_ref_local==0){_Py_MergeZeroRefcount();// merge refcount}}else{_Py_DecRefShared();// slow path}}void_Py_MergeZeroRefcount(PyObject*op){if(op->ob_ref_shared==0){// quick deallocation code path (common case)op->ob_tid=0;_Py_Dealloc(op);}else{// slower merging path not shown}}

The reference implementation[17] contains implementations of_Py_MergeZeroRefcount and_Py_DecRefShared.

Note that the above is pseudocode: in practice, the implementationshould use “relaxed atomics” to accessob_tid andob_ref_local to avoid undefined behavior in C and C++.

Deferred Reference Counting

A few types of objects, such as top-level functions, code objects,modules, and methods, tend to be frequently accessed by many threadsconcurrently. These objects don’t necessarily live for the lifetime ofthe program, so immortalization is not a good fit. This PEP proposes alimited form of deferred reference counting to avoid contention onthese objects’ reference count fields in multi-threaded programs.

Typically, the interpreter modifies objects’ reference counts as theyare pushed to and popped from the interpreter’s stack. Theinterpreter skips these reference counting operations for objectsthat use deferred reference counting. Objects that support deferredreference counting are marked by setting the two most significantbits in the local reference count field to one.

Because some reference counting operations are skipped, the referencecount fields no longer reflect the true number of references to theseobjects. The true reference count is the sum of the reference countfields plus any skipped references from each thread’s interpreterstack. The true reference count can only be safely computed when allthreads are paused during cyclic garbage collection. Consequently,objects that use deferred reference counting can only be deallocatedduring garbage collection cycles.

Note that the objects that use deferred reference counting alreadynaturally form reference cycles in CPython, so they would typically bedeallocated by the garbage collector even without deferred referencecounting. For example, top-level functions and modules form a referencecycle as do methods and type objects.

Garbage Collector Modifications for Deferred Reference Counting

The tracing garbage collector finds and deallocates unreferencedobjects. Currently, the tracing garbage collector only findsunreferenced objects that are part of a reference cycle. Withdeferred reference counting, the tracing garbage collector will alsofind and collect some unreferenced objects that may not be part ofany reference cycle, but whose collection has been delayed due todeferred reference counting. This requires that all objects thatsupport deferred reference counting also have a corresponding typeobject that supports tracing garbage collection (through thePy_TPFLAGS_HAVE_GC flag). Additionally, the garbage collectorwill need to traverse each thread’s stack to add references to the GCreference count at the start of each collection.

Reference Counting Type Objects

Type objects (PyTypeObject) use a mix of reference countingtechniques. Statically allocated type objects are immortalized becausethe objects already live for the lifetime of the program. Heap typeobjects use deferred reference counting in combination with per-threadreference counting. Deferred reference counting is not sufficient toaddress the multi-threaded scaling bottlenecks with heap types becausemost references to heap types are from object instances, not referenceson the interpreter stack.

To address this, heap type reference counts are partially stored in adistributed manner in per-thread arrays. Every thread stores anarray of local reference counts for each heap type object. Heap typeobjects are assigned a unique number that determines its position inthe local reference count arrays. A heap type’s true reference countis the sum of its entries in the per-thread arrays, plus the referencecount on thePyTypeObject, plus any deferred references in theinterpreter stack.

Threads may grow their own type reference count arrays as needed whenincrementing or decrementing the local reference count of a typeobject.

Use of the per-thread reference count arrays is limited to a fewplaces:

  • PyType_GenericAlloc(PyTypeObject*type,Py_ssize_tnitems):Increments the current thread’s local reference count fortype,if it is a heap type.
  • subtype_dealloc(PyObject*self): Decrements the current thread’slocal reference count forself->ob_type, if the type is a heaptype.
  • gcmodule.c: Adds each thread’s local reference counts to thegc_refs count for the corresponding heap type object.

Additionally, when a thread terminates, it adds any non-zero localreference counts to each type object’s own reference count field.

Memory Management

CPython currently uses an internal allocator, pymalloc, which isoptimized for small object allocation. The pymalloc implementation isnot thread-safe without the GIL. This PEP proposes replacing pymallocwith mimalloc, a general-purpose thread-safe allocator with goodperformance, including for small allocations.

Using mimalloc, with some modifications, also addresses two otherissues related to removing the GIL. First, traversing the internalmimalloc structures allows the garbage collector to find all Pythonobjects without maintaining a linked list. This is described in moredetail in the garbage collection section. Second, mimalloc heaps andallocations based on size class enable collections like dict togenerally avoid acquiring locks during read-only operations. This isdescribed in more detail in the collection thread-safety section.

CPython already requires that objects that support garbage collectionuse the GC allocator APIs (typically indirectly by callingPyType_GenericAlloc). This PEP would add additional requirementsto the use of the Python allocator APIs. First, Python objects mustbe allocated through object allocation APIs, such asPyType_GenericAlloc,PyObject_Malloc, or other Python APIsthat wrap those calls. Python objects should not be allocated throughother APIs, such as raw calls to C’s malloc or the C++ new operator.Additionally,PyObject_Malloc should be used only for allocatingPython objects; it should not be used for allocating buffers,storages, or other data structures that are not PyObjects.

This PEP also imposes restrictions on the pluggable allocator API(PyMem_SetAllocator). When compiling without the GIL, allocatorsset using this API must eventually delegate the allocation to thecorresponding underlying allocator, such asPyObject_Malloc, forPython object allocations. This allows for allocators that “wrap”underlying allocators, such as Python’s tracemalloc and debugallocator, but not for wholly replacing the allocator.

CPython Free Lists

CPython makes use of free lists to speed up the allocation of small,frequently allocated objects like tuples and numbers. These freelists are moved toPyThreadState from per-interpreter state.

Garbage Collection (Cycle Collection)

The CPython garbage collector requires the following changes to workwith this proposal:

  • Use of “stop-the-world” to provide thread-safety guarantees thatwere previously provided by the GIL.
  • Elimination of generational garbage collection in favor ofnon-generational collector.
  • Integration with deferred reference counting and biased referencecounting.

Additionally, the above changes enable removing the_gc_prev and_gc_next fields from GC objects. The GC bitsthat stored the tracked, finalized, and unreachable states are movedto theob_gc_bits field in the PyObject header.

Stop-the-World

The CPython cycle garbage collector currently relies on the globalinterpreter lock to prevent other threads from accessing Pythonobjects while the collector finds cycles. The GIL is never releasedduring the cycle-finding routine, so the collector can rely onstable (i.e., unchanging) reference counts and references for theduration of that routine. However, following cycle detection, the GILmay be temporarily released while calling objects’ finalizers andclear (tp_clear) functions, allowing other threads to run in aninterleaved fashion.

When running without the GIL, the implementation needs a way to ensurethat reference counts remain stable during cycle detection. Threadsrunning Python code must be paused to ensure that references andreference counts remain stable. Once the cycles are identified, otherthreads are resumed.

The current CPython cyclic garbage collector involves twocycle-detection passes during each garbage collection cycle.Consequently, this requires two stop-the-world pauses when running thegarbage collector without the GIL. The first cycle-detection passidentifies cyclic trash. The second pass runs after finalizers toidentify which objects still remain unreachable. Note that otherthreads are resumed before finalizers andtp_clear functions arecalled to avoid introducing potential deadlocks that are not present inthe current CPython behavior.

Thread States

To support pausing threads for garbage collection, the PyThreadStategets a new “status” field. Like the other fields in PyThreadState,the status field is not part of the public CPython API. The statusfield may be in one of three states:

  • ATTACHED
  • DETACHED
  • GC

TheATTACHED andDETACHED states correspond closely toacquiring and releasing the global interpreter lock. When compilingwithout the GIL, functions that previously acquired the GIL insteadtransition the thread state toATTACHED, and functions thatpreviously released the GIL transition the thread statetoDETACHED. Just as threads previously needed to acquire theGIL before accessing or modifying Python objects, they now must be intheATTACHED state before accessing or modifying Pythonobjects. Since the same public C-API functions “attach” the thread aspreviously acquired the GIL (e.g.,PyEval_RestoreThread), therequirements for thread initialization in extensions remain the same.The substantial difference is that multiple threads can be in theattached state simultaneously, while previously only one thread couldacquire the GIL at a time.

During stop-the-world pauses, the thread performing garbage collectionneeds to ensure that no other thread is accessing or modifying Pythonobjects. All other threads must be in the “GC” state. The garbagecollection thread can transition other threads from theDETACHEDstate to the GC state using an atomic compare-and-swap operation onthe status field. Threads in theATTACHED state are requested topause themselves and set their status to “GC”, using theexisting “eval breaker” mechanism. At the end of the stop-the-worldpause, all threads in the “GC” state are set toDETACHED andwoken up if they are paused. Threads that were previously attached(i.e., executing Python bytecode) can re-attach (set their threadstates toATTACHED) and resume executing Python code. Threadsthat were previouslyDETACHED ignore the notification.

Generations

The existing Python garbage collector uses three generations. Whencompiling without the GIL, the garbage collector will only use a singlegeneration (i.e., it will be non-generational). The primary reason forthis change is to reduce the impact of the stop-the-world pauses inmultithreaded applications. Frequent stop-the-world pauses forcollecting the young generation would have more of an impact onmulti-threaded applications than less frequent collections.

Integration With Deferred and Biased Reference Counting

To find unreferenced objects, the cyclic garbage collector computesthe difference between the number of incoming references and theobject’s reference count. This difference is calledgc_refs andis stored in the_gc_prev field. Ifgc_refs is greater thanzero, then the object is guaranteed to be alive (i.e., not cyclictrash). Ifgc_refs is zero, then the object is only alive if itis transitively referenced by another live object. When computingthis difference, the collector should traverse each thread’s stack,and for every deferred reference, increment thegc_refs for thereferred object. Since generator objects also have stacks withdeferred references, the same procedure is applied to eachgenerator’s stack.

Python unit tests commonly usegc.collect() to ensure that anyunreferenced objects are destructed and their finalizers run. Sincebiased reference counting can delay the destruction of some objectsthat are referenced by multiple threads, it’s convenient to ensurethat those objects are destructed during garbage collection, eventhough they may not be part of any reference cycles. While otherthreads are paused, the garbage collector thread should merge thereference counts for any queued objects, but not call any destructorseven if the combined reference count is zero. (Calling destructorswhile other threads are paused risks introducing deadlocks.) Onceother threads are resumed, the GC thread should call_Py_Deallocon those objects with a zero merged reference count.

Container Thread-Safety

In CPython, the global interpreter lock protects against corruption ofinternal interpreter states when multiple threads concurrently accessor modify Python objects. For example, if multiple threadsconcurrently modify the same list, the GIL ensures that the length ofthe list (ob_size) accurately matches the number of elements, andthat the reference counts of each element accurately reflect thenumber of references to those elements. Without the GIL — andabsent other changes — concurrent modifications would corrupt thosefields and likely lead to program crashes.

The GIL does not necessarily ensure that operations are atomic orremain correct when multiple operations occur concurrently. Forexample,list.extend(iterable) may not appear atomic if theiterable has an iterator implemented in Python (or releases the GILinternally). Similarly,list.remove(x) can remove the wrongobject if it overlaps with another operation that modifies the list,depending on the implementation of the equality operator. Still, theGIL ensures that some operations are effectively atomic. For example,the constructorlist(set) atomically copies the items of the setto a new list, and some code relies on that copy being atomic(i.e., having a snapshot of the items in the set). This PEP preservesthat property.

This PEP proposes using per-object locks to provide many of the sameprotections that the GIL provides. For example, every list,dictionary, and set will have an associated lightweight lock. Alloperations that modify the object must hold the object’s lock. Mostoperations that read from the object should acquire the object’s lockas well; the few read operations that can proceed without holding alock are described below.

Per-object locks with critical sections provide weaker protectionsthan the GIL. Because the GIL doesn’t necessarily ensure thatconcurrent operations are atomic or correct, the per-object lockingscheme also cannot ensure that concurrent operations are atomic orcorrect. Instead, per-object locking aims for similar protections asthe GIL, but with mutual exclusion limited to individual objects.

Most operations on an instance of a container type require lockingthat object. For example:

  • list.append,list.insert,list.repeat,PyList_SetItem
  • dict.__setitem__,PyDict_SetItem
  • list.clear,dict.clear
  • list.__repr__,dict.__repr__, etc.
  • list.extend(iterable)
  • setiter_iternext

Some operations operate directly on two container objects, withknowledge about both containers’ internal structure. For example,there are internal specializations oflist.extend(iterable) forspecific iterable types, likeset. These operations need to lockboth container objects because they access the internals of bothobjects simultaneously. Note that the generic implementation oflist.extend only needs to lock one object (the list) because theother object is accessed indirectly through the thread-safe iteratorAPI. Operations that lock two containers are:

  • list.extend(list),list.extend(set),list.extend(dictitems), and other specializations where the implementationis specialized for argument type.
  • list.concat(list)
  • list.__eq__(list),dict.__eq__(dict)

Some simple operations can be implemented directly with atomicaccesses and do not need locks because they only access a singlefield. These operations include:

  • len(list) i.e.,list_length(PyListObject*a)
  • len(dict)
  • len(set)

A select few operations optimistically avoid locking to improveperformance. These require special implementations and cooperationfrom the memory allocator:

  • list[idx] (list_subscript)
  • dict[key] (dict_subscript)
  • listiter_next,dictiter_iternextkey/value/item
  • list.contains

Borrowed References

Per-object locking provides many of the important protections that theGIL provides, but there are a few cases where it’s not sufficient.For example, code that relies on upgrading a borrowed reference toan “owned” reference may be unsafe in certain circumstances:

PyObject*item=PyList_GetItem(list,idx);Py_INCREF(item);

The GIL ensures that no other thread can modify the list in betweenthe access and thePy_INCREF call. Without the GIL – even withper-object locking – another thread might modify the list leading toitem being freed between the access and thePy_INCREF call.

The problematic borrowed reference APIs are supplemented withfunctions that return “new references” but are otherwiseequivalent:

  • PyList_FetchItem(list,idx) forPyList_GetItem
  • PyDict_FetchItem(dict,key) forPyDict_GetItem
  • PyWeakref_FetchObject forPyWeakref_GetObject

Note that some APIs that return borrowed references, such asPyTuple_GetItem, are not problematic because tuples areimmutable. Similarly, not all uses of the above APIs are problematic.For example,PyDict_GetItem is often used for parsing keywordargument dictionaries in function calls; those keyword argumentdictionaries are effectively private (not accessible by otherthreads).

Python Critical Sections

Straightforward per-object locking could introduce deadlocks that werenot present when running with the GIL. Threads may hold locks formultiple objects simultaneously because Python operations can nest.Operations on objects can invoke operations on other objects,acquiring multiple per-object locks. If threads try to acquire thesame locks in different orders, they will deadlock.

This PEP proposes a scheme called “Python critical sections” toimplicitly release per-object locks to avoid deadlocks. Tounderstand the scheme, we first introduce a general approach to avoiddeadlocks, and then propose a refinement of that approach with betterperformance.

One way to avoid deadlocks is to allow threads to hold only the lock(or locks) for a single operation at a time (typically a single lock,but some operations involve two locks as described above). When athread begins a nested operation it should suspend the locks for anyouter operation: before beginning the nested operation, the locks forthe outer operation are released and when the nested operationcompletes, the locks for the outer operation are reacquired.

Additionally, the locks for any active operation should be suspendedaround potentially blocking operations, such as I/O (i.e., operationsthat would have released the GIL). This is because the interactionbetween locks and blocking operations can lead to deadlocks in thesame way as the interaction between multiple locks.

To improve performance, this PEP proposes a variation of the abovescheme that still avoids deadlocks. Instead of immediatelysuspending locks any time a nested operation begins, locks are onlysuspended if the thread would block (i.e., would have released theGIL). This reduces the number of lock acquisitions and releases fornested operations, while avoiding deadlocks.

The proposed API for Python critical sections are the following fourmacros. These are intended to be public (usable by C-API extensions),but not part of the limited API:

  • Py_BEGIN_CRITICAL_SECTION(PyObject*op);:Begins a critical section by acquiring the mutex for the referencedobject. If the object is already locked, then locks for anyoutstanding critical sections are released before this thread waitsfor referenced object to be unlocked.
  • Py_END_CRITICAL_SECTION;:Ends the most recent operation, unlocking the mutex. The nextmost recent previous critical section (if any) is resumed if it iscurrently suspended.
  • Py_BEGIN_CRITICAL_SECTION2(PyObject*a,PyObject*b);:Begins a critical section by acquiring the mutexes for two objects.To ensure consistent lock ordering, the order of acquisition isdetermined by memory address (i.e., the mutex with lower memoryaddress is acquired first). If either mutex is already locked, thenlocks for any outstanding critical sections are released before thisthread waits for the referenced objects to be unlocked.
  • Py_END_CRITICAL_SECTION2;:Behaves the same asPy_END_CRITICAL_SECTION but unlocks twoobjects.

Additionally, when a thread transitions from theATTACHED state totheDETACHED state, it should suspend any active criticalsections. When transitioning fromDETACHED toATTACHED, themost recent suspended critical section, if any, should be resumed.

Note that operations that lock two containers simultaneously need to usethePy_BEGIN_CRITICAL_SECTION2 macro. It is not sufficient to nesttwo calls toPy_BEGIN_CRITICAL_SECTION because the inner criticalsection may release the locks from the outer critical section.

Optimistically Avoiding Locking

A few operations ondict andlist optimistically avoidacquiring the per-object locks. They have a fast path operation thatdoes not acquire locks, but may fall back to a slower operation thatacquires the dictionary’s or list’s lock when another thread isconcurrently modifying that container.

The operations with an optimistic fast path are:

  • PyDict_FetchItem/GetItem anddict.__getitem__
  • PyList_FetchItem/GetItem andlist.__getitem__

Additionally, iterators fordict andlist use the abovefunctions so they also optimistically avoid locking when returningthe next item.

There are two motivations for avoiding lock acquisitions in thesefunctions. The primary reason is that it is necessary for scalablemulti-threaded performance even for simple applications. Dictionarieshold top-level functions in modules and methods for classes. Thesedictionaries are inherently highly shared by many threads inmulti-threaded programs. Contention on these locks in multi-threadedprograms for loading methods and functions would inhibit efficientscaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overheadand improve single-threaded performance. Although lock acquisitionhas low overhead compared to most operations, accessing individualelements of lists and dictionaries are fast operations (so thelocking overhead is comparatively larger) and frequent (so theoverhead has more impact).

This section describes the challenges with implementing dictionary andlist accesses without locking followed by a description of this PEP’schanges to the Python interpreter required to address thosechallenges.

The main challenge is that retrieving an item from a list ordictionary and incrementing the reference count of that item is notan atomic operation. In between the time the item is retrieved andthe reference count is incremented, another thread may modify thelist or dictionary, possibly freeing the memory for the previouslyretrieved item.

A partial attempt at addressing this issue would be to convert thereference count increment to a conditional increment, onlyincrementing the reference count if it’s not zero. This change isnot sufficient because when a Python object’s reference count reacheszero, the object’s destructor is called and the memory storing theobject may be re-used for other data structures or returned to theoperating system. Instead, this PEP proposes a technique to ensurethat the reference count fields remain valid for the duration of theaccess, so that the conditional reference count increment is safe.This technique requires cooperation from the memory allocator(mimalloc) as well as changes to the list and dictionary objects. Theproposed technique is similar to read-copy update (RCU)[6], asynchronization mechanism widely used in the Linux kernel.

The current implementation oflist_item (the C functionimplementinglist.__getitem__) is the following:

Py_INCREF(a->ob_item[i]);returna->ob_item[i];

The proposed implementation uses the conditional increment(_Py_TRY_INCREF) and has additional checks:

PyObject**ob_item=atomic_load(&a->ob_item);PyObject*item=atomic_load(&ob_item[i]);if(!item||!_Py_TRY_INCREF(item))gotoretry;if(item!=atomic_load(&ob_item[i])){Py_DECREF(item);gotoretry;}if(ob_item!=atomic_load(&a->ob_item)){Py_DECREF(item);gotoretry;}returnitem;

The “retry” subroutine implements the locked fallback path whenconcurrent modifications to the list cause the above fast,non-locking path to fail:

retry:PyObject*item;Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);item=a->ob_item[i];Py_INCREF(item);Py_END_CRITICAL_SECTION(a->ob_mutex);returnitem;

The modifications to thedict implementation are similar, becausethe relevant parts of both list and dictionary retrieval involveloading an item/value from an array at a known index.

The additional checks following the conditional increment arenecessary because the scheme allows immediate re-use of memory,including the memory that previously held aPyObject structure orlist ordict array. Without these extra checks, the functionmight return a Python object that was never in the list, if thememory occupied by the Python object previously held a differentPyObject whose memory previously stored an item in the list.

Mimalloc Changes for Optimisticlist anddict Access

The implementation requires additional constraints to the memoryallocator, including some changes to the mimalloc code. Somebackground on mimalloc’s implementation is helpful to understand therequired changes. Individual allocations from mimalloc arecalled “blocks.” Mimalloc “pages” contain consecutive blocks thatare all the same size. A mimalloc “page” is similar toa “superblock” in other allocators; it is NOT an operating systempage. A mimalloc “heap” contains pages of various size classes; eachpage belongs to a single heap. If none of the blocks of a page areallocated, then mimalloc may re-use the page for a different sizeclass or different heap (i.e., it might reinitialize the page).

The list and dictionary access scheme works by partially restrictingre-use of mimalloc pages so that reference count fields remains validfor the duration of the access. The restricted re-use of mimallocpages is enforced by having separate heaps for Python objects[7]. This ensures that even if an item is freed during accessand the memory reused for a new object, the new object’s referencecount field is placed at the same location in memory. The referencecount field remains valid (or zero) across allocations.

Python objects that supportPy_TPFLAGS_MANAGED_DICT have theirdictionary and weak reference fields preceding thePyObjectheader, so their reference count fields are at a different offset fromthe start of their allocations. They are stored in a separate mimallocheap. Additionally, non-GC objects are stored in their own heap sothat the GC only has to look at GC objects. There are therefore threemimalloc heaps for Python objects, one for non-GC objects, one for GCobjects with managed dictionaries, and one for GC objects withoutmanaged dictionaries.

Mimalloc Page Reuse

It is beneficial to keep the restrictions on mimalloc page reuse to ashort period of time to avoid increasing overall memory usage.Precisely limiting the restrictions to list and dictionary accesseswould minimize memory usage, but would require expensivesynchronizations. At the other extreme, keeping the restrictionsuntil the next GC cycle would avoid introducing any extrasynchronizations, but would potentially increase memory usage.

This PEP proposes a system that lies between those two extremes basedon FreeBSD’s “GUS”[8]. It uses a combination of global andper-thread counters (or “sequence numbers”) to coordinate thedetermination of when it is safe to reuse an empty mimalloc page fora different heap or for a different size class, or to return it tothe operating system:

  • There is a global write sequence number that monotonicallyincreases.
  • When a mimalloc page is empty, it’s tagged with the current writesequence number. The thread may also atomically increment theglobal write sequence number.
  • Each thread has a local read sequence number that records the mostrecent write sequence number it has observed.
  • Threads may observe the write sequence number whenever they are notin a list or dictionary access. The reference implementation doesthis in mimalloc’s slow-path allocation function. This is calledregularly enough to be useful, but not so frequently as tointroduce significant overhead.
  • There is a global read sequence number that stores the minimum ofall active threads’ read sequence numbers. A thread may update theglobal read sequence number by scanning each threads’ local readsequence number. The reference implementation does this beforeallocating a fresh mimalloc page if there are restricted pagesthat could possibly be reused.
  • An empty mimalloc page may be reused for a different heap or sizeclass when the global read sequence number is larger than thepage’s tag number.

The condition that the global read sequence number is larger than thepage’s tag is sufficient because it ensures that any thread that hada concurrent optimistic list or dictionary access is finished withthat access. In other words, there are no threads accessing theempty blocks in the freed page, so the page can be used for any otherpurpose or even returned to the operating system.

Optimisticdict andlist Access Summary

This PEP proposes a technique for thread-safe list and dictionaryaccesses that typically avoids acquiring locks. This reducesexecution overhead and avoids some multi-threaded scaling bottlenecksin common operations, like calling functions and methods. The schemeworks by placing temporary restrictions on mimalloc page reuse toensure that objects’ reference count fields remain valid afterobjects are freed so that conditional reference count incrementoperations are safe. The restrictions are placed on mimalloc pagesinstead of on individual objects to improve opportunities for memoryreuse. The restrictions are lifted as soon as the system candetermine that there are no outstanding accesses involving the emptymimalloc page. To determine this, the system uses a combination oflightweight per-thread sequence counters and also tags pages whenthey are empty. Once each thread’s local counter is larger than thepage’s tag, it can be reused for any purpose or returned to theoperating system. The restrictions are also lifted whenever thecyclic garbage collector runs because the stop-the-world pauseensures that threads do not have any outstanding references to emptymimalloc pages.

Specializing Interpreter

The specializing interpreter requires some changes to be thread-safewhen running without the GIL:

  • Concurrent specializations are prevented by using a mutex. Thisprevents multiple threads writing to the same inline cache.
  • In multi-threaded programs running without the GIL, each bytecode isonly specialized once. This prevents a thread from reading apartially written inline cache.
  • Locking also ensures that cached values oftp_version_tag andkeys_version are consistent with the cached descriptors and othervalues.
  • Modifications to inline counters use “relaxed atomics”. In otherwords, some counter decrements may be missed or overwritten, but thatdoes not affect correctness.

Py_mod_gil Slot

In--disable-gil builds, when loading an extension, CPython willcheck for a newPEP 489-stylePy_mod_gil slot. If the slot isset toPy_mod_gil_not_used, then extension loading proceeds asnormal. If the slot is not set, the interpreter pauses all threads andenables the GIL before continuing. Additionally, the interpreter willissue a visible warning naming the extension, that the GIL was enabled(and why) and the steps the user can take to override it.

PYTHONGIL Environment Variable

In--disable-gil builds, the user can also override the behavior atruntime by setting thePYTHONGIL environment variable. SettingPYTHONGIL=0, forces the GIL to be disabled, overriding the moduleslot logic. SettingPYTHONGIL=1, forces the GIL to be enabled.

ThePYTHONGIL=0 override is important because extensions that arenot thread-safe can still be useful in multi-threaded applications. Forexample, one may want to use the extension from only a single thread orguard access by locks. For context, there are already some extensionsthat are not thread-safe even with the GIL, and users already have totake these sorts of steps.

ThePYTHONGIL=1 override is sometimes useful for debugging.

Rationale

Non-Generational Garbage Collection

This PEP proposes switching from a generational cyclic garbagecollector to a non-generational collector (when CPython is builtwithout the GIL). That is equivalent to only having one generation(the “old” generation). There are two reasons for this proposedchange.

Cyclic garbage collection, even for just the young generation,requires pausing other threads in the program. The author isconcerned that frequent collections of the young generation wouldinhibit efficient scaling in multi-threaded programs. This is aconcern for young generations (but not the old generation) becausethe young generations are collected after a fixed number ofallocations, while the collections for the older generation arescheduled in proportion to the number of live objects in the heap.Additionally, it is difficult to efficiently keep track of objects ineach generation without the GIL. For example, CPython currently usesa linked list of objects in each generation. If CPython were to keepthat design, those lists would need to be made thread-safe, and it’snot clear how to do that efficiently.

Generational garbage collection is used to good effect in many otherlanguage runtimes. For example, many of the Java HotSpot garbagecollector implementations use multiple generations[11]. Inthese runtimes, a young generation is frequently a throughput win:since a large percentage of the young generation is typically “dead,”the GC is able to reclaim a large amount memory relative to theamount of work performed. For example, several Java benchmarks showover 90% of “young” objects are typically collected[12][13]. This is commonly referred to as the “weakgenerational hypothesis;” the observation is that most objects dieyoung. This pattern is reversed in CPython due to the use ofreference counting. Although most objects still die young, they arecollected when their reference counts reach zero. Objects thatsurvive to a garbage collection cycle are most likely to remainalive[14]. This difference means that generationalcollection is much less effective in CPython than in many otherlanguage runtimes[15].

Optimistic Avoiding Locking indict andlist Accesses

This proposal relies on a scheme that mostly avoids acquiring lockswhen accessing individual elements in lists and dictionaries. Notethat this is not “lock free” in the sense of “lock-free”and “wait-free” algorithms that guarantee forward progress. Itsimply avoids acquiring locks (mutexes) in the common case to improveparallelism and reduce overhead.

A much simpler alternative would be to use reader-writer locks toprotect dictionary and list accesses. Reader-writer locks allowconcurrent reads, but not updates, which might seem ideal for listand dictionaries. The problem is that reader-writer locks havesubstantial overhead and poor scalability, particularly when thecritical sections are small, as they are for single-elementdictionary and list accesses[9]. The poor readerscalability stems from the fact that readers must all update the samedata structure, such as the number of readers inpthread_rwlocks.

The technique described in this PEP is related to RCU(“read-copy-update”)[6] and, to a lesser extent, hazardpointers, two well-known schemes for optimizing concurrent,read-mostly data structures. RCU is widely used in the Linux kernelto protect shared data structures in a scalable manner. Both thetechnique in this PEP and RCU work by deferring reclamation whilereaders may be accessing the concurrent data structure. RCU is mostcommonly used to protect individual objects (like hash tables orlinked lists), while this PEP proposes a scheme to protect largerblocks of memory (mimalloc “pages”)[10].

The need for this scheme is largely due to the use of referencecounting in CPython. If CPython only relied on a tracing garbagecollector, then this scheme would probably not be necessary becausetracing garbage collectors already defer reclamation in the requiredmanner. This would not “solve” scaling issues, but would shift manyof the challenges to the garbage collector implementation.

Backwards Compatibility

This PEP poses a number of backwards compatibility issues whenbuilding CPython with the--disable-gil flag, but those issues donot occur when using the default build configuration. Nearly all thebackwards compatibility concerns involve the C-API:

  • CPython builds without the GIL will not be ABI compatible with thestandard CPython build or with the stable ABI due to changes to thePython object header needed to support biased reference counting.C-API extensions will need to be rebuilt specifically for thisversion.
  • C-API extensions that rely on the GIL to protect global state orobject state in C code will need additional explicit locking toremain thread-safe when run without the GIL.
  • C-API extensions that use borrowed references in ways that are notsafe without the GIL will need to use the equivalent new APIs thatreturn non-borrowed references. Note that only some uses ofborrowed references are a concern; only references to objects thatmight be freed by other threads pose an issue.
  • Custom memory allocators (PyMem_SetAllocator) are required todelegate the actual allocation to the previously set allocator. Forexample, the Python debug allocator and tracing allocators willcontinue to work because they delegate the allocation to theunderlying allocator. On the other hand, wholesale replacing of theallocator (e.g., with jemalloc or tcmalloc) will not workcorrectly.
  • Python objects must be allocated through the standard APIs, such asPyType_GenericNew orPyObject_Malloc. Non-Python objectsmustnot be allocated through those APIs. For example, it iscurrently acceptable to allocate buffers(non-Python objects)throughPyObject_Malloc; that will no longer be allowed andbuffers should instead be allocated throughPyMem_Malloc,PyMem_RawMalloc, ormalloc.

There are fewer potential backwards compatibility issues for Pythoncode:

  • Destructors and weak reference callbacks for code objects andtop-level function objects are delayed until the next cyclicgarbage collection due to the use of deferred reference counting.
  • Destructors for some objects accessed by multiple threads may bedelayed slightly due to biased reference counting. This is rare:most objects, even those accessed by multiple threads, aredestroyed immediately as soon as their reference counts are zero.Two places in the Python standard library tests requiredgc.collect() calls to continue to pass.

Distribution

This PEP poses new challenges for distributing Python. At least forsome time, there will be two versions of Python requiring separatelycompiled C-API extensions. It may take some time for C-API extensionauthors to build--disable-gil compatible packages and uploadthem to PyPI. Additionally, some authors may be hesitant to supportthe--disable-gil mode until it has wide adoption, but adoptionwill likely depend on the availability of Python’s rich set ofextensions.

To mitigate this, the author will work with Anaconda to distributea--disable-gil version of Python together with compatiblepackages from conda channels. This centralizes the challenges ofbuilding extensions, and the author believes this will enable morepeople to use Python without the GIL sooner than they would otherwisebe able to.

Performance

The changes to make CPython thread-safe without the GIL increaseexecution overhead for--disable-gil builds. The performanceimpact is different for programs that use only a single thread comparedto programs that use multiple threads, so the table below reportsexecution overhead separately for these types of programs separately.

Execution Overhead on pyperformance 1.0.6
Intel SkylakeAMD Zen 3
One thread6%5%
Multiple threads8%7%

The baseline used to measure overhead is018be4c fromPR 19474,which implements immortal objects for Python 3.12. The largestcontribution to execution overhead is biased reference countingfollowed by per-object locking. For thread-safety reasons, anapplication running with multiple threads will only specialize a givenbytecode once; this is why the overhead for programs that use multiplethreads is larger compared to programs that only use one thread.However, with the GIL disabled, programs that use multiple threadsshould also be able to more effectively use multiple CPU cores.

Note that this PEP would not affect the performance of the default(non--disable-gil) builds of CPython.

Build Bots

The stable build bots will also include--disable-gil builds.

How to Teach This

As part of implementing the--disable-gil mode, the author willwrite a “HOWTO” guide[18] for making packages compatible whenrunning Python without the GIL.

Reference Implementation

There are two GitHub repositories implementing versions of CPythonwithout the GIL:

Thenogil-3.12 is based on Python 3.12.0a4. It is useful forevaluating single-threaded execution overhead and as a referenceimplementation for this PEP. It is less useful for evaluating C-APIextension compatibility because many extensions are not currentlycompatible with Python 3.12. Due to limited time for the 3.12 port,thenogil-3.12 implementation does not skip all deferred referencecounts. As a temporary work around, the implementation immortalizesobjects that use deferred reference counting in programs that spawnmultiple threads.

Thenogil repository is based on Python 3.9.10. It is useful forevaluating multi-threading scaling in real world applications andextension compatibility. It is more stable and well tested than thenogil-3.12 repository.

Alternatives

Python currently supports a number of ways to enable parallelism, butthe existing techniques come with significant limitations.

Multiprocessing

The multiprocessing library allows Python programs to start andcommunicate with Python subprocesses. This allows for parallelismbecause each subprocess has its own Python interpreter (i.e., there’sone GIL per process). Multiprocessing has a few substantiallimitations. Communication between processes is limited: objectsgenerally need to be serialized or copied to shared memory. Thisintroduces overhead (due to serialization) and complicates buildingAPIs on top of multiprocessing. Starting a subprocess is also moreexpensive than starting a thread, especially with the “spawn”implementation. Starting a thread takes ~100 µs, while spawning asubprocess takes ~50 ms (50,000 µs) due to Python re-initialization.

Finally, many C and C++ libraries support access from multiplethreads but do not support access or use across multiple processes.

Releasing the GIL in C-API Extensions

C-API extensions can release the GIL around long running functions.This allows for some degree of parallelism, since multiple threadscan run concurrently when the GIL is released, but the overhead ofacquiring and releasing the GIL typically prevents this from scalingefficiently beyond a few threads. Many scientific computinglibraries release the GIL in computational heavy functions, and theCPython standard library releases the GIL around blocking I/O.

Internal Parallelization

Functions implemented in C may use multiple threads internally. Forexample, Intel’s NumPy distribution, PyTorch, and TensorFlow all usethis technique to internally parallelize individual operations. Thisworks well when the basic operations are large enough to beparallelized efficiently, but not when there are many smalloperations or when the operations depend on some Python code. Callinginto Python from C requires acquiring the GIL – even short snippetsof Python code can inhibit scaling.

Related Work

Per-Interpreter GIL

The recently acceptedPEP 684 proposes a per-interpreter GIL toaddress multi-core parallelism. This would allow parallelism betweeninterpreters in the same process, but places substantial restrictionson sharing Python data between interpreters. Both this PEPandPEP 684 address the multi-core parallelism, but with differenttradeoffs and techniques. It is feasible to implement both PEPs inCPython at the same time.

Gilectomy

Gilectomy[20] was a project by Larry Hastings to remove theGIL in CPython. Like the design proposed by this PEP, the Gilectomysupported multiple threads running in parallel within the sameinterpreter (i.e., “free-threading”) and made use of fine-grainedlocking. The reference implementation in this PEP improves onsingle-threaded performance and scalability compared to theGilectomy.

PyParallel

PyParallel[21] was a proof-of-concept fork of Python 3.3 byTrent Nelson that supported multiple threads running simultaneouslyin a single Python process. The fork introduced the conceptof “parallel threads” – threads that can run simultaneously whilethe main Python thread is suspended. Parallel threads had read-onlyaccess to objects created by the main thread. Objects created withinparallel threads lived for the lifetime of the creating thread. ForHTTP servers, this might correspond to the lifetime of a request.

python-safethread

The python-safethread[22] project was a patch toPython 3.0 by Adam Olsen to remove the GIL. Some aspects of theproject are similar to the design proposed by this PEP. Both usefine-grained locking and optimize reference counting for caseswhere the object is created and accessed by the same thread.

Greg Stein’s Free-Threading Patch

In 1996, Greg Stein published a patch against Python 1.4 that removedthe GIL[23]. The patch used atomic reference counting onWindows and a global reference count lock on Linux. List anddictionary accesses were protected by mutexes. Parts of the patchwere adopted in CPython. In particular, the patch introduced aPyThreadState structure and correct per-thread exception handling.

Dave Beazley revisited the patch in a 2011 blog post[24].

Jython and IronPython

Some alternative Python implementations like Jython[25] andIronPython[26] do not have a global interpreter lock.However, they do not support CPython extensions. (The implementationscan interface with code written in Java or C#).

PyPy-STM

The pypy-stm[27] interpreter is a variant of PyPy that usessoftware transactional memory. The authors report single-threadedperformance overhead in the 20%-50% range compared to PyPy. It isnot compatible with CPython extensions.

Rejected Ideas

Why Not Use a Concurrent Garbage Collector?

Many recent garbage collectors are mostly concurrent – they avoid longstop-the-world pauses by allowing the garbage collector to runconcurrently with the application. So why not use a concurrentcollector?

Concurrent collection requires write barriers (or read barriers). Theauthor is not aware of a way to add write barriers to CPython withoutsubstantially breaking the C-API.

Why Not DeprecatePyDict_GetItem in Favor ofPyDict_FetchItem?

This PEP proposes a new APIPyDict_FetchItem which behaves likePyDict_GetItem, but returns a new reference instead of a borrowedreference. As described inBorrowed References, some uses ofborrowed references that were safe when running with the GIL areunsafe when running without the GIL and need to be replaced byfunctions likePyDict_FetchItem that return new references.

This PEP doesnot propose deprecatingPyDict_GetItem and similarfunctions that return borrowed references for a few reasons:

  • Many of the uses of borrowed references are safe, even when runningwithout the GIL. For example, C API functions often usePyDict_GetItem to retrieve items from the keywordargument dictionary. These calls are safe because the keywordargument dictionary is only visible to a single thread.
  • I tried this approach early on and found that wholesale replacing ofPyDict_GetItem withPyDict_FetchItem frequently introducednew reference counting bugs. In my opinion, the risk ofintroducing new reference counting bugs generally outweighs therisks of missing aPyDict_GetItem call that is unsafe withoutthe GIL.

Why Not Use PEP 683 Immortalization?

LikePEP 683, this PEP proposes an immortalization scheme forPython objects, but the PEPs use different bit representations tomark immortal objects. The schemes cannot be identical because thisPEP depends on biased reference counting, which has two referencecount fields instead of one.

Open Issues

Improved Specialization

The Python 3.11 release introduced quickening and specialization as partof the faster CPython project, substantially improving performance.Specialization replaces slow bytecode instructions with fastervariants[19]. To maintain thread-safety, applications that usemultiple threads (and run without the GIL) will only specialize eachbytecode once, which can lower performance on some programs. It ispossible to support specializing multiple times, but that requires moreinvestigation and is not part of this PEP.

Python Build Modes

This PEP introduces a new build mode (--disable-gil) that is notABI compatible with the standard build mode. The additional buildmode adds complexity for both Python core developers and extensiondevelopers. The author believes a worthwhile goal is to combinethese build modes and have the global interpreter lock controlled atruntime, possibly disabled by default. The path to this goal remainsan open issue, but a possible path might look like the following:

  1. In 2024, CPython 3.13 is released with support for a--disable-gil build time flag. There are two ABIs forCPython, one with the GIL and one without. Extension authorstarget both ABIs.
  2. After 2–3 releases, (i.e., in 2026–2027), CPython is releasedwith the GIL controlled by a runtime environment variable orflag. The GIL is enabled by default. There is only a single ABI.
  3. After another 2–3 release (i.e., 2028–2030), CPython switches tothe GIL being disabled by default. The GIL can still be enabledat runtime via an environment variable or command line flag.

This PEP covers the first step, with the remaining steps left as openissues. In this scenario, there would be a two to three year periodwhere extension authors would target an extra CPython build persupported CPU architecture and OS.

Integration

The reference implementation changes approximately 15,000 lines of codein CPython and includes mimalloc, which is also approximately 15,000lines of code. Most changes are not performance sensitive and can beincluded in both--disable-gil and the default builds. Somemacros, likePy_BEGIN_CRITICAL_SECTION will be no-ops in thedefault build. Thee author does not expect a huge number of#ifdefstatements to support the--disable-gil builds.

Mitigations for Single-Threaded Performance

The changes proposed in the PEP will increase execution overhead for--disable-gil builds compared to Python builds with the GIL. Inother words, it will have slower single-threaded performance. Thereare some possible optimizations to reduce execution overhead,especially for--disable-gil builds that only use a singlethread. These may be worthwhile if a longer term goal is to have asingle build mode, but the choice of optimizations and theirtrade-offs remain an open issue.

References

[1]
“Exploiting Parallelism Opportunities with Deep Learning Frameworks.”Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks. 2019.https://arxiv.org/abs/1908.04705.
[2]
“Using Python for Model Inference in Deep Learning.”Zachary DeVito, Jason Ansel, Will Constable, Michael Suo, Ailing Zhang, Kim Hazelwood. 2021.https://arxiv.org/abs/2104.00254. See Figure 5.
[3]
“Biased reference counting: minimizing atomic operations in garbage collection”.Jiho Choi, Thomas Shull, and Josep Torrellas. PACT 2018.https://dl.acm.org/doi/abs/10.1145/3243176.3243195.
[4]
PEP 683 – Immortal Objects, Using a Fixed Refcount.
[5]
https://github.com/colesbury/nogil/blob/f7e45d6bfbbd48c8d5cf851c116b73b85add9fc6/Include/object.h#L428-L455.
[6] (1,2)
“What is RCU, Fundamentally?”Paul E. McKenney, Jonathan Walpole. 2017.https://lwn.net/Articles/262464/
[7]
There are two heaps for Python objects because PyObjectsthat support cyclic garbage collection have extra fields precedingthe PyObject struct.
[8]
“Global Unbounded Sequences (GUS)”https://github.com/freebsd/freebsd-src/blob/9408f36627b74a472dc82f7a43320235c0c9055a/sys/kern/subr_smr.c#L44.See alsohttps://people.kernel.org/joelfernandes/gus-vs-rcu.
[9]
“Is Parallel Programming Hard, And, If So, What Can You Do About It?”Paul E. McKenney. 2022.https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.
[10]
SLAB_TYPESAFE_BY_RCU is an example in which RCUprotects blocks of memory and not any individual object. Seehttps://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#analogy-with-reference-counting.
[11]
“HotSpot Virtual Machine Garbage Collection Tuning Guide.”https://docs.oracle.com/en/java/javase/12/gctuning/hotspot-virtual-machine-garbage-collection-tuning-guide.pdf.Most of the hotspot garbage collectors are generational, with thenotable exception of ZGC, although there is ongoing work to makethat generational.
[12]
The DaCapo Benchmarks: Java Benchmarking Development andAnalysis.See column “Nursery Survival” in Table 4.
[13]
“Exploiting memory usage patterns to improve garbage collections in Java.”https://dl.acm.org/doi/abs/10.1145/1852761.1852768.
[14]
“most things usually turn out to be reachable”https://github.com/python/cpython/blob/cd6655a8589e99ae4088b3bed4a692a19ed48779/Modules/gcmodule.c#L1106.
[15]
The Go team observed something similar in Go, but due toescape analysis and pass-by-value instead of referencecounting. Recent versions of Go use a non-generational garbagecollector.https://go.dev/blog/ismmkeynote.
[16]
https://github.com/colesbury/nogil.
[17]
https://github.com/colesbury/nogil-3.12.
[18]
Python HOWTOs.https://docs.python.org/3/howto/index.html.
[19]
PEP 659 – Specializing Adaptive Interpreter.
[20]
Gilectomy.Larry Hastings. 2016.https://github.com/larryhastings/gilectomy/tree/gilectomy.
[21]
PyParallel.Trent Nelson. 2016.http://pyparallel.org/.
[22]
python-safethread.Adam Olsen. 2008.https://launchpad.net/python-safethread
[23]
https://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.tar.gz.
[24]
An Inside Look at the GIL Removal Patch of Lore.David Beazley. 2011.https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-patch-of.html.
[25]
Jython.https://www.jython.org/
[26]
IronPython.https://ironpython.net/
[27]
PyPy: Software Transactional Memory.https://doc.pypy.org/en/latest/stm.html

Acknowledgments

Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providingfeedback on drafts of this PEP.

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.


Source:https://github.com/python/peps/blob/main/peps/pep-0703.rst

Last modified:2025-02-01 08:55:40 GMT


[8]ページ先頭

©2009-2026 Movatter.jp