Python Enhancement Proposals

Python »
PEP Index »
PEP 583

PEP 583 – A Concurrency Memory Model for Python

Author:: Jeffrey Yasskin <jyasskin at google.com>
Status:

Abstract

This PEP describes how Python programs may behave in the presence ofconcurrent reads and writes to shared variables from multiple threads.We use ahappens before relation to define when variable accessesare ordered or concurrent. Nearly all programs should simply use locksto guard their shared variables, and this PEP highlights some of thestrange things that can happen when they don’t, but programmers oftenassume that it’s ok to do “simple” things without locking, and it’ssomewhat unpythonic to let the language surprise them. Unfortunately,avoiding surprise often conflicts with making Python run quickly, sothis PEP tries to find a good tradeoff between the two.

Rationale

So far, we have 4 major Python implementations – CPython,Jython,IronPython, andPyPy – as well as lots of minor ones. Some ofthese already run on platforms that do aggressive optimizations. Ingeneral, these optimizations are invisible within a single thread ofexecution, but they can be visible to other threads executingconcurrently. CPython currently uses aGIL to ensure that otherthreads see the results they expect, but this limits it to a singleprocessor. Jython and IronPython run on Java’s or .NET’s threadingsystem respectively, which allows them to take advantage of more coresbut can also show surprising values to other threads.

So that threaded Python programs continue to be portable betweenimplementations, implementers and library authors need to agree onsome ground rules.

A couple definitions

Variable: A name that refers to an object. Variables are generallyintroduced by assigning to them, and may be destroyed by passingthem todel. Variables are fundamentally mutable, whileobjects may not be. There are several varieties of variables:module variables (often called “globals” when accessed from withinthe module), class variables, instance variables (also known asfields), and local variables. All of these can be shared betweenthreads (the local variables if they’re saved into a closure).The object in which the variables are scoped notionally has adict whose keys are the variables’ names.
Object: A collection of instance variables (a.k.a. fields) and methods.At least, that’ll do for this PEP.
Program Order: The order that actions (reads and writes) happen within a thread,which is very similar to the order they appear in the text.
Conflicting actions: Two actions on the same variable, at least one of which is a write.
Data race: A situation in which two conflicting actions happen at the sametime. “The same time” is defined by the memory model.

Two simple memory models

Before talking about the details of data races and the surprisingbehaviors they produce, I’ll present two simple memory models. Thefirst is probably too strong for Python, and the second is probablytoo weak.

Sequential Consistency

In a sequentially-consistent concurrent execution, actions appear tohappen in a global total order with each read of a particular variableseeing the value written by the last write that affected thatvariable. The total order for actions must be consistent with theprogram order. A program has a data race on a given input when one ofits sequentially consistent executions puts two conflicting actionsnext to each other.

This is the easiest memory model for humans to understand, although itdoesn’t eliminate all confusion, since operations can be split in oddplaces.

Happens-before consistency

The program contains a collection ofsynchronization actions, whichin Python currently include lock acquires and releases and threadstarts and joins. Synchronization actions happen in a global totalorder that is consistent with the program order (they don’thave tohappen in a total order, but it simplifies the description of themodel). A lock releasesynchronizes with all later acquires of thesame lock. Similarly, givent=threading.Thread(target=worker):

A call tot.start() synchronizes with the first statement inworker().
The return fromworker() synchronizes with the return fromt.join().
If the return fromt.start() happens before (see below) a calltot.isAlive() that returnsFalse, the return fromworker() synchronizes with that call.

We call the source of the synchronizes-with edge arelease operationon the relevant variable, and we call the target anacquire operation.

Thehappens before order is the transitive closure of the programorder with the synchronizes-with edges. That is, actionA happensbefore actionB if:

A falls before B in the program order (which means they run in thesame thread)
A synchronizes with B
You can get to B by following happens-before edges from A.

An execution of a program is happens-before consistent if each readR sees the value of a writeW to the same variable such that:

R does not happen beforeW, and
There is no other writeV that overwroteW beforeR got achance to see it. (That is, it can’t be the case thatW happensbeforeV happens beforeR.)

You have a data race if two conflicting actions aren’t related byhappens-before.

An example

Let’s use the rules from the happens-before model to prove that thefollowing program prints “[7]”:

classQueue:def__init__(self):self.l=[]self.cond=threading.Condition()defget():withself.cond:whilenotself.l:self.cond.wait()ret=self.l[0]self.l=self.l[1:]returnretdefput(x):withself.cond:self.l.append(x)self.cond.notify()myqueue=Queue()defworker1():x=[7]myqueue.put(x)defworker2():y=myqueue.get()printythread1=threading.Thread(target=worker1)thread2=threading.Thread(target=worker2)thread2.start()thread1.start()

Becausemyqueue is initialized in the main thread beforethread1 orthread2 is started, that initialization happensbeforeworker1 andworker2 begin running, so there’s no wayfor either to raise a NameError, and bothmyqueue.l andmyqueue.cond are set to their final objects.
The initialization ofx inworker1 happens before it callsmyqueue.put(), which happens before it callsmyqueue.l.append(x), which happens before the call tomyqueue.cond.release(), all because they run in the samethread.
Inworker2,myqueue.cond will be released and re-acquireduntilmyqueue.l contains a value (x). The call tomyqueue.cond.release() inworker1 happens before that lastcall tomyqueue.cond.acquire() inworker2.
That last call tomyqueue.cond.acquire() happens beforemyqueue.get() readsmyqueue.l, which happens beforemyqueue.get() returns, which happens beforeprinty, againall because they run in the same thread.
Because happens-before is transitive, the list initially stored inx in thread1 is initialized before it is printed in thread2.

Usually, we wouldn’t need to look all the way into a thread-safequeue’s implementation in order to prove that uses were safe. Itsinterface would specify that puts happen before gets, and we’d reasondirectly from that.

Surprising behaviors with races

Lots of strange things can happen when code has data races. It’s easyto avoid all of these problems by just protecting shared variableswith locks. This is not a complete list of race hazards; it’s just acollection that seem relevant to Python.

In all of these examples, variables starting withr are localvariables, and other variables are shared between threads.

Zombie values

This example comes from theJava memory model:

Initiallypisq andp.x==0.
Thread 1 Thread 2
r1 = p r6 = p
r2 = r1.x r6.x = 3
r3 = q
r4 = r3.x
r5 = r1.x
Can producer2==r5==0 butr4==3, proving thatp.x went from 0 to 3 and back to 0.

Thread 1	Thread 2
r1 = p	r6 = p
r2 = r1.x	r6.x = 3
r3 = q
r4 = r3.x
r5 = r1.x

A good compiler would like to optimize out the redundant load ofp.x in initializingr5 by just re-using the value alreadyloaded intor2. We get the strange result if thread 1 sees memoryin this order:

Evaluation Computes Why
r1 = p
r2 = r1.x r2 == 0
r3 = q r3 is p
p.x = 3 Side-effect of thread 2
r4 = r3.x r4 == 3
r5 = r2 r5 == 0 Optimized from r5 = r1.x because r2 == r1.x.

Evaluation	Computes	Why
r1 = p
r2 = r1.x	r2 == 0
r3 = q	r3 is p
p.x = 3		Side-effect of thread 2
r4 = r3.x	r4 == 3
r5 = r2	r5 == 0	Optimized from r5 = r1.x because r2 == r1.x.

Inconsistent Orderings

FromN2177: Sequential Consistency for Atomics, and also known asIndependent Read of Independent Write (IRIW).

Initially,a==b==0.
Thread 1 Thread 2 Thread 3 Thread 4
r1 = a r3 = b a = 1 b = 1
r2 = b r4 = a
We may getr1==r3==1 andr2==r4==0, proving boththata was written beforeb (thread 1’s data), and thatb was written beforea (thread 2’s data). SeeSpecialRelativity for areal-world example.

Thread 1	Thread 2	Thread 3	Thread 4
r1 = a	r3 = b	a = 1	b = 1
r2 = b	r4 = a

This can happen if thread 1 and thread 3 are running on processorsthat are close to each other, but far away from the processors thatthreads 2 and 4 are running on and the writes are not beingtransmitted all the way across the machine before becoming visible tonearby threads.

Neither acquire/release semantics nor explicit memory barriers canhelp with this. Making the orders consistent without locking requiresdetailed knowledge of the architecture’s memory model, but Javarequires it for volatiles so we could use documentation aimed at itsimplementers.

A happens-before race that’s not a sequentially-consistent race

From the POPL paper about the Java memory model [#JMM-popl].

Initially,x==y==0.
Thread 1 Thread 2
r1 = x r2 = y
if r1 != 0: if r2 != 0:
y = 42 x = 42
Canr1==r2==42???

Thread 1	Thread 2
r1 = x	r2 = y
if r1 != 0:	if r2 != 0:
y = 42	x = 42

In a sequentially-consistent execution, there’s no way to get anadjacent read and write to the same variable, so the program should beconsidered correctly synchronized (albeit fragile), and should onlyproducer1==r2==0. However, the following execution ishappens-before consistent:

Statement Value Thread
r1 = x 42 1
if r1 != 0: true 1
y = 42 1
r2 = y 42 2
if r2 != 0: true 2
x = 42 2

Statement	Value	Thread
r1 = x	42	1
if r1 != 0:	true	1
y = 42		1
r2 = y	42	2
if r2 != 0:	true	2
x = 42		2

WTF, you are asking yourself. Because there were no inter-threadhappens-before edges in the original program, the read of x in thread1 can see any of the writes from thread 2, even if they only happenedbecause the read saw them. Thereare data races in thehappens-before model.

We don’t want to allow this, so the happens-before model isn’t enoughfor Python. One rule we could add to happens-before that wouldprevent this execution is:

If there are no data races in any sequentially-consistentexecution of a program, the program should have sequentiallyconsistent semantics.

Java gets this rule as a theorem, but Python may not want all of themachinery you need to prove it.

Self-justifying values

Also from the POPL paper about the Java memory model [#JMM-popl].

Initially,x==y==0.
Thread 1 Thread 2
r1 = x r2 = y
y = r1 x = r2
Canx==y==42???

Thread 1	Thread 2
r1 = x	r2 = y
y = r1	x = r2

In a sequentially consistent execution, no. In a happens-beforeconsistent execution, yes: The read of x in thread 1 is allowed to seethe value written in thread 2 because there are no happens-beforerelations between the threads. This could happen if the compiler orprocessor transforms the code into:

Thread 1 Thread 2
y = 42 r2 = y
r1 = x x = r2
if r1 != 42:
y = r1

Thread 1	Thread 2
y = 42	r2 = y
r1 = x	x = r2
if r1 != 42:
y = r1

It can produce a security hole if the speculated value is a secretobject, or points to the memory that an object used to occupy. Javacares a lot about such security holes, but Python may not.

Uninitialized values (direct)

From several classic double-checked locking examples.

Initially,d==None.
Thread 1 Thread 2
while not d: pass d = [3, 4]
assert d[1] == 4
This could raise an IndexError, fail the assertion, or, withoutsome care in the implementation, cause a crash or other undefinedbehavior.

Thread 1	Thread 2
while not d: pass	d = [3, 4]
assert d[1] == 4

Thread 2 may actually be implemented as:

r1=list()r1.append(3)r1.append(4)d=r1

Because the assignment to d and the item assignments are independent,the compiler and processor may optimize that to:

r1=list()d=r1r1.append(3)r1.append(4)

Which is obviously incorrect and explains the IndexError. If we thenlook deeper into the implementation ofr1.append(3), we may findthat it andd[1] cannot run concurrently without causing their ownrace conditions. In CPython (without the GIL), those race conditionswould produce undefined behavior.

There’s also a subtle issue on the reading side that can cause thevalue of d[1] to be out of date. Somewhere in the implementation oflist, it stores its contents as an array in memory. This array mayhappen to be in thread 1’s cache. If thread 1’s processor reloadsd from main memory without reloading the memory that ought tocontain the values 3 and 4, it could see stale values instead. As faras I know, this can only actually happen on Alphas and maybe Itaniums,and we probably have to prevent it anyway to avoid crashes.

Uninitialized values (flag)

From several more double-checked locking examples.

Initially,d==dict() andinitialized==False.
Thread 1 Thread 2
while not initialized: pass d[‘a’] = 3
r1 = d[‘a’] initialized = True
r2 = r1 == 3
assert r2
This could raise a KeyError, fail the assertion, or, without somecare in the implementation, cause a crash or other undefinedbehavior.

Thread 1	Thread 2
while not initialized: pass	d[‘a’] = 3
r1 = d[‘a’]	initialized = True
r2 = r1 == 3
assert r2

Becaused andinitialized are independent (except in theprogrammer’s mind), the compiler and processor can rearrange thesealmost arbitrarily, except that thread 1’s assertion has to stay afterthe loop.

Inconsistent guarantees from relying on data dependencies

This is a problem with Javafinal variables and the proposeddata-dependency ordering in C++0x.

First execute:
g=[]defInit():g.extend([1,2,3])return[1,2,3]h=None
Then in two threads:
Thread 1 Thread 2
while not h: pass r1 = Init()
assert h == [1,2,3] freeze(r1)
assert h == g h = r1
If h has semantics similar to a Javafinal variable (exceptfor being write-once), then even though the first assertion isguaranteed to succeed, the second could fail.

Thread 1	Thread 2
while not h: pass	r1 = Init()
assert h == [1,2,3]	freeze(r1)
assert h == g	h = r1

Data-dependent guarantees like thosefinal provides only work ifthe access is through the final variable. It’s not even safe toaccess the same object through a different route. Unfortunately,because of how processors work, final’s guarantees are only cheap whenthey’re weak.

The rules for Python

The first rule is that Python interpreters can’t crash due to raceconditions in user code. For CPython, this means that race conditionscan’t make it down into C. For Jython, it means thatNullPointerExceptions can’t escape the interpreter.

Presumably we also want a model at least as strong as happens-beforeconsistency because it lets us write a simple description of howconcurrent queues and thread launching and joining work.

Other rules are more debatable, so I’ll present each one with pros andcons.

Data-race-free programs are sequentially consistent

We’d like programmers to be able to reason about their programs as ifthey were sequentially consistent. Since it’s hard to tell whetheryou’ve written a happens-before race, we only want to requireprogrammers to prevent sequential races. The Java model does thisthrough a complicated definition of causality, but if we don’t want toinclude that, we can just assert this property directly.

No security holes from out-of-thin-air reads

If the program produces a self-justifying value, it could exposeaccess to an object that the user would rather the program not see.Again, Java’s model handles this with the causality definition. Wemight be able to prevent these security problems by banningspeculative writes to shared variables, but I don’t have a proof ofthat, and Python may not need those security guarantees anyway.

Restrict reorderings instead of defining happens-before

The .NET [#CLR-msdn] and x86 [#x86-model] memory models are based ondefining which reorderings compilers may allow. I think that it’seasier to program to a happens-before model than to reason about allof the possible reorderings of a program, and it’s easier to insertenough happens-before edges to make a program correct, than to insertenough memory fences to do the same thing. So, although we couldlayer some reordering restrictions on top of the happens-before base,I don’t think Python’s memory model should be entirely reorderingrestrictions.

Atomic, unordered assignments

Assignments of primitive types are already atomic. If you assign3<<72+5 to a variable, no thread can see only part of the value.Jeremy Manson suggested that we extend this to all objects. Thisallows compilers to reorder operations to optimize them, withoutallowing some of the more confusinguninitialized values. Thebasic idea here is that when you assign a shared variable, readerscan’t see any changes made to the new value before the assignment, orto the old value after the assignment. So, if we have a program like:

Initially,(d.a,d.b)==(1,2), and(e.c,e.d)==(3,4).We also haveclassObj(object):pass.
Thread 1 Thread 2
r1 = Obj() r3 = d
r1.a = 3 r4, r5 = r3.a, r3.b
r1.b = 4 r6 = e
d = r1 r7, r8 = r6.c, r6.d
r2 = Obj()
r2.c = 6
r2.d = 7
e = r2
(r4,r5) can be(1,2) or(3,4) but nothing else, and(r7,r8) can be either(3,4) or(6,7) but nothingelse. Unlike if writes were releases and reads were acquires,it’s legal for thread 2 to see(e.c,e.d)==(6,7)and(d.a,d.b)==(1,2) (out of order).

Thread 1	Thread 2
r1 = Obj()	r3 = d
r1.a = 3	r4, r5 = r3.a, r3.b
r1.b = 4	r6 = e
d = r1	r7, r8 = r6.c, r6.d
r2 = Obj()
r2.c = 6
r2.d = 7
e = r2

This allows the compiler a lot of flexibility to optimize withoutallowing users to see some strange values. However, because it relieson data dependencies, it introduces some surprises of its own. Forexample, the compiler could freely optimize the above example to:

Thread 1 Thread 2
r1 = Obj() r3 = d
r2 = Obj() r6 = e
r1.a = 3 r4, r7 = r3.a, r6.c
r2.c = 6 r5, r8 = r3.b, r6.d
r2.d = 7
e = r2
r1.b = 4
d = r1

Thread 1	Thread 2
r1 = Obj()	r3 = d
r2 = Obj()	r6 = e
r1.a = 3	r4, r7 = r3.a, r6.c
r2.c = 6	r5, r8 = r3.b, r6.d
r2.d = 7
e = r2
r1.b = 4
d = r1

As long as it didn’t let the initialization ofe move above any ofthe initializations of members ofr2, and similarly ford andr1.

This also helps to ground happens-before consistency. To see theproblem, imagine that the user unsafely publishes a reference to anobject as soon as she gets it. The model needs to constrain whatvalues can be read through that reference. Java says that every fieldis initialized to 0 before anyone sees the object for the first time,but Python would have trouble defining “every field”. If instead wesay that assignments to shared variables have to see a value at leastas up to date as when the assignment happened, then we don’t run intoany trouble with early publication.

Two tiers of guarantees

Most other languages with any guarantees for unlocked variablesdistinguish between ordinary variables and volatile/atomic variables.They provide many more guarantees for the volatile ones. Python can’teasily do this because we don’t declare variables. This may or maynot matter, since python locks aren’t significantly more expensivethan ordinary python code. If we want to get those tiers back, we could:

Introduce a set of atomic types similar to Java’s[5]or C++’s[6]. Unfortunately, we couldn’t assign tothem with=.
Without requiring variable declarations, we could also specify thatall of the fields on a given object are atomic.
Extend the__slots__ mechanism[7] with a parallel__volatiles__ list, and maybe a__finals__ list.

Sequential Consistency

We could just adopt sequential consistency for Python.This avoids all of thehazards mentioned above,but it prohibits lots of optimizations too.As far as I know, this is the current model of CPython,but if CPython learned to optimize out some variable reads,it would lose this property.

If we adopt this, Jython’sdict implementation may no longer beable to use ConcurrentHashMap because that only promises to createappropriate happens-before edges, not to be sequentially consistent(although maybe the fact that Java volatiles are totally orderedcarries over). Both Jython and IronPython would probably need to useAtomicReferenceArrayor the equivalent for any__slots__ arrays.

Adapt the x86 model

The x86 model is:

Loads are not reordered with other loads.
Stores are not reordered with other stores.
Stores are not reordered with older loads.
Loads may be reordered with older stores to different locations butnot with older stores to the same location.
In a multiprocessor system, memory ordering obeys causality (memoryordering respects transitive visibility).
In a multiprocessor system, stores to the same location have atotal order.
In a multiprocessor system, locked instructions have a total order.
Loads and stores are not reordered with locked instructions.

In acquire/release terminology, this appears to say that every storeis a release and every load is an acquire. This is slightly weakerthan sequential consistency, in that it allowsinconsistentorderings, but it disallowszombie values and the compileroptimizations that produce them. We would probably want to weaken themodel somehow to explicitly allow compilers to eliminate redundantvariable reads. The x86 model may also be expensive to implement onother platforms, although because x86 is so common, that may notmatter much.

Upgrading or downgrading to an alternate model

We can adopt an initial memory model without totally restrictingfuture implementations. If we start with a weak model and want to getstronger later, we would only have to change the implementations, notprograms. Individual implementations could also guarantee a strongermemory model than the language demands, although that could hurtinteroperability. On the other hand, if we start with a strong modeland want to weaken it later, we can add afrom__future__importweak_memory statement to declare that some modules are safe.

Implementation Details

The required model is weaker than any particular implementation. Thissection tries to document the actual guarantees each implementationprovides, and should be updated as the implementations change.

CPython

Uses the GIL to guarantee that other threads don’t see funnyreorderings, and does few enough optimizations that I believe it’sactually sequentially consistent at the bytecode level. Threads canswitch between any two bytecodes (instead of only between statements),so two threads that concurrently execute:

i=i+1

withi initially0 could easily end up withi==1 insteadof the expectedi==2. If they execute:

i+=1

instead, CPython 2.6 will always give the right answer, but it’s easyto imagine another implementation in which this statement won’t beatomic.

PyPy

Also uses a GIL, but probably does enough optimization to violatesequential consistency. I know very little about this implementation.

Jython

Provides true concurrency under theJava memory model and storesall object fields (except for those in__slots__?) in aConcurrentHashMap,which provides fairly strong ordering guarantees. Local variables ina function may have fewer guarantees, which would become visible ifthey were captured into a closure that was then passed to anotherthread.

IronPython

Provides true concurrency under the CLR memory model, which probablyprotects it fromuninitialized values. IronPython uses a lockedmap to store object fields, providing at least as many guarantees asJython.

References

[1]

The Java Memory Model, by Jeremy Manson, Bill Pugh, andSarita Adve(http://www.cs.umd.edu/users/jmanson/java/journal.pdf). This paperis an excellent introduction to memory models in general and haslots of examples of compiler/processor optimizations and thestrange program behaviors they can produce.

[2]

N2480: A Less Formal Explanation of theProposed C++ Concurrency Memory Model, Hans Boehm(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.html)

[3]

Memory Models: Understand the Impact of Low-LockTechniques in Multithreaded Apps, Vance Morrison(http://msdn2.microsoft.com/en-us/magazine/cc163715.aspx)

[4]

Intel(R) 64 Architecture Memory Ordering White Paper(http://www.intel.com/products/processor/manuals/318147.pdf)

[5]

Package java.util.concurrent.atomic(http://java.sun.com/javase/6/docs/api/java/util/concurrent/atomic/package-summary.html)

[6]

C++ Atomic Types and Operations, Hans Boehm andLawrence Crowl(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html)

[7]

__slots__ (http://docs.python.org/ref/slots.html)