Movatterモバイル変換


[0]ホーム

URL:


ContentsMenuExpandLight modeDark modeAuto light/dark modeAuto light/dark, in light modeAuto light/dark, in dark modeSkip to content
Alyssa Coghlan's Python Notes
Alyssa Coghlan's Python Notes
Back to top

Efficiently Exploiting Multiple Cores with Python

Published:

21st June, 2015

Last Updated:

21st June, 2015

Both the Python reference interpreter (CPython), and the alternativeinterpeter that offers the fastest single-threaded performance for purePython code (PyPy) use a Global Interpreter Lock to avoid various problemsthat arise when using threading models that implicitly allowing concurrentaccess to objects from multiple threads of execution.

This approach has been the source of much debate, both online and off-, so thisarticle aims to summarise the design trade-offs involved, and give details onsome of the prospects for improvement that are being investigated.

Why is using a Global Interpreter Lock (GIL) a problem?

The key issue with Python implementations that rely on a GIL (most notablyCPython and PyPy) is that it makes them entirely unsuitable for cases wherea developer wishes to:

  • use shared memory threading to exploit multiple cores on a single machine

  • write their entire application in Python, including CPU bound elements

  • use CPython or PyPy as their interpreter

This combination of requirements simply doesn’t work - the GIL effectivelyrestricts bytecode execution to a single core, thus rendering pure Pythonthreads an ineffective tool for distributing CPU bound work across multiplecores.

At this point, one of those requirements has to give. The developer has toeither:

  • use a parallel execution technique other than shared memory threading

    The main alternative provided in the standard library for CPU boundapplications is the multiprocessing module, which works well for workloadsthat consist of relatively small numbers of long running computational tasks,but results in excessive message passing overhead if the duration ofindividual operations is short

  • move parts of the application out into binary extension modules, includingwrappers for existing third party libraries

    This is the path taken by the NumPy/SciPy community, Cython users andmany other people using Python as a glue language to bind disparatecomponents together

  • use a Python implementation that doesn’t rely on a GIL

    While the main purpose of Jython and IronPython is to interoperate with otherJVM and CLR components, they are also free threaded thanks to thecross-platform threading primitives provide by the underlying virtualmachines.

  • use a language other than Python for the entire application

    This is a popular approach for established systems where the problem domainis now well understood and the service’s scope is stable. In these kinds ofsituations, efficiency-of-execution considerations start to weigh moreheavily than ease-of-modification considerations in the choice of developmentlanguage, which tends to count heavily against languages like Python thatdeliberately avoid doing any kind of inter-module consistency analysis atcompile time.

    This approach also works very well for applications that happen to fallentirely within the purview of more specialised languages, such as JavaScriptfor web service development, Go for network services and command lineapplications, and Julia for data analysis.

Many Python developers find this annoying - they want to use threads to takefull advantage of multicore machinesand they want to use Python, but theyhave the CPython and PyPy core developers in their way saying “Sorry, we don’trecommend that style of programming”.

What alternative approaches are available?

Assuming that a free-threaded Python implementation like Jython or IronPythonisn’t suitable for a given application, then there are two main approachesto handling distribution of CPU bound Python workloads across multiple cores inthe presence of a GIL. Which one will be more appropriate will depend on thespecific task and developer preference.

The approach most directly supported by python-dev is the use ofprocess-based concurrency rather than thread-based concurrency. Allmajor threading APIs have a process-based equivalent, allowing threadingto be used for concurrent synchronous IO calls, while multiple processes canbe used for concurrent CPU bound calculations in Python code. Thestrict memory separation imposed by using multiple processes also makesit much easier to avoid many of the common traps of multi-threaded code.As another added bonus, for applications which would benefit from scalingbeyond the limits of a single machine, starting with multiple processesmeans that any reliance on shared memory will already be gone, removingone of the major stumbling blocks to distributed processing.

The main downside of this approach is that the overhead of messageserialisation and interprocess communication can significantly increase theresponse latency and reduce the overall throughput of an application (see thisPyCon 2015 presentationfrom David Beazley for some example figures). Whether or not this overheadis considered acceptable in any given application will depend on the relativeproportion of time that application ends up spending on interprocesscommunication overhead versus doing useful work.

The major alternative approach promoted by the community is best representedbyCython. Cython is a Python superset designed to be compiled down toCPython C extension modules. One of the features Cython offers (as ispossible from any binary extension module) is the ability to explicitly releasethe GIL around a section of code. By releasing the GIL in this fashion,Cython code can fully exploit all cores on a machine for computationallyintensive sections of the code, while retaining all the benefits of Pythonfor other parts of the application.

Numba is another tool in a similar vein - it uses LLVM to convert Pythoncode to machine code that can run with the GIL released (as well asexploiting vector operations provided by the CPU when appopriate).

This approach also works when calling out toany code written in otherlanguages: release the GIL when handing over control to the external library,reacquire it when returning control to the Python interpreter. Many binaryextension modules for Python already do this implicitly (especially thosedeveloped by the members of the Python community focused on data analysistasks).

Why hasn’t resolving this been a priority for the core development team?

Speaking for myself, I came to Python by way of the unittest module: I neededto write a better test suite for a C++ library that communicated with acustom DSP application, and by using SWIG and the Python unittest moduleI was able to do so easily. Using Python for the test suite also let meeasily play audio files out of the test hardware into the DSP unit beingtested. Still in the test domain, I later used Python to communicate withserial hardware (and push data through serial circuits and analyse whatcame back), write prototype clients to ensure new hardware systems were fullfunctional replacements for old ones and write hardware simulators to allowmore integration errors to be caught during software development ratherthan only after new releases were deployed to the test lab that had realhardware available.

Other current Python users are often in a similar situation: we’re using Pythonas an orchestration language, getting other pieces of hardware and softwareto play nice, so the Python components just need to be “fast enough”, andallow multipleexternal operations to occur in parallel, rather thannecessarily needing to run Python bytecode operations concurrently. When ourPython code isn’t the bottleneck in our overall system throughput, and wearen’t operating at a scale where even small optimisations to our software canhave a significant impact on our overall CPU time and power consumption costs,then investing effort in speeding up our Python code doesn’t offer a goodreturn on our time.

This is certainly true of the scientific community, where the heavy numericlifting is often done in C or FORTRAN, and the Python components are there tomake everything hang together in a way that humans can read relatively easily.

In the case of web development, while the speed of the application servermay become a determining factor at truly massive scale, smaller applicationsare likely to gain more through language independent techniques like adding aVarnish caching server in front of the overall application, and a memory cacheto avoid repeating calcuations for common inputs before the application codeitself is likely to become the bottleneck.

This means for the kind of use case where Python is primarily playing anorchestration role, as well as those where the application is IO boundrather than CPU bound, being able to run across multiple cores doesn’t reallyprovide a lot of benefit - the Python code was never the bottleneck in thefirst place, so focusing optimisation efforts on the Python components doesn’tmake sense.

Instead, people drop out of pure Python code into an environment that isvastly easier to optimise and already supports running across multiple coreswithin a single process. This may be hand written C or C++ code, it may besomething with Pythonic syntax but reduced dynamism like Cython or Numba, orit may be another more static language on a preexisting runtime like the JVMor the CLR, but however it is achieved, the level shift allows optimisationsand parallelism to be applied at the places where they will do the most goodfor the overall speed of the application.

Why isn’t “just remove the GIL” the obvious answer?

Removing the GILis the obvious answer. The problem with this phrase isthe “just” part, not the “remove the GIL” part.

One of the key issues with threading models built on sharednon-transactional memory is that they are a broken approach to generalpurpose concurrency. Armin Rigo has explained that far more eloquentlythan I can in the introduction to hisSoftware Transactional Memory workfor PyPy, but the general idea is that threading is to concurrency as thePython 2 Unicode model is to text handling - it works great a lot of thetime, but if you make a mistake (which is inevitable in any non-trivialprogram) the consequences are unpredictable (and often catastrophic from anapplication stability point of view), and the resulting situations arefrequently a nightmare to debug.

The advantages of GIL-style coarse grained locking for the CPythoninterpreter implementation are that it makes naively threaded codemore likely to run correctly, greatly simplifies the interpreterimplementation (thus increasing general reliability and ease ofporting to other platforms) and has almost zero overhead whenrunning in single-threaded mode for simple scripts or event drivenapplications which don’t need to interact with any synchronous APIs (asthe GIL is not initialised until the threading support is imported,or initialised via the C API, the only overhead is a booleancheck to see if the GIL has been created).

The CPython development team have long had an essential list of requirementsthat any major improvement to CPython’s parallel execution support would beexpected to meet before it could be considered for incorporation into thereference interpreter:

  • must not substantially slow down single-threaded applications

  • must not substantially increase latency times in IO bound applications

  • threading support must remain optional to ease porting to platformswith no (or broken) threading primitives

  • must minimise breakage of current end user Python code that implicitlyrelies on the coarse-grained locking provided by the GIL (I recommendconsulting Armin’s STM introduction on the challenges posed by this)

  • must remain compatible with existing third party C extensions that relyon refcounting and the GIL (I recommend consulting with the cpyextand IronClad developers both on the difficulty of meeting thisrequirement, and the lack of interest many parts of the community havein any Python implementation that doesn’t abide by it)

  • must achieve all of these without reducing the number of supportedplatforms for CPython, or substantially increasing the difficulty ofporting the CPython interpreter to a new platform (I recommend consultingwith the JVM and CLR developers on the difficulty of producing andmaintaining high performance cross platform threading primitives).

It is important to keep in mind that CPython already has a significant userbase (sufficient to see Python ranked by IEEE Spectrum in 2014 as one of thetop 5 programming languages in the world), and it’s necessarily the case thatthese users either don’t find the GIL to be an intolerable burden for their usecases, or else find it to be a problem that is tolerably easy to work around.

Core development efforts in the concurrency and parallelism arena have thushistorically focused on better serving the needs of those users by providingbetter primitives for easily distributing work across multipleprocesses, and to perform multiple IO operations in parallel. Examples of thisapproach include the initial incorporation of themultiprocessing module,which aims to make it easy to migrate from threaded code to multiprocess code,along with the addition of theconcurrent.futures module in Python 3.2,which aims to make it easy to take serial code and dispatch it to multiplethreads (for IO bound operations) or multiple processes (for CPU boundoperations), theasyncio module in Python 3.4 (which provides fullsupport for explicit asynchronous programming in the standard library) andthe introduction of the dedicatedasync/await syntax for nativecoroutines in Python 3.5.

For IO bound code (with no CPU bound threads present), or, equivalently, codethat invokes external libraries to perform calculations (as is the case formost serious number crunching code, such as that using NumPy and/or Cython),the GIL does place an additional constraint on the application, but one thatis acceptable in many cases: a single core must be able to handle allPython execution on the machine, with other cores either left idle(IO bound systems) or busy handling calculations (external libraryinvocations). If that is not the case, then multiple interpreter processeswill be needed, just as they are in the case of any CPU bound Python threads.

What are the key problems with fine-grained locking as an answer?

For seriously parallel problems, a free threaded interpreter that usesfine-grained locking to scale across multiple cores doesn’t help all thatmuch, as it is desired to scale not only to multiple cores on a single machine,but to multiplemachines. As soon as a second machine enters the picture,shared memory based concurrency can’t help you: you need to use a parallelexecution model (such as message passing or a shared datastore) that allowsinformation to be passed between processes, either on a single machine or onmultiple machines. (Folks that have this kind of problem to solve would be welladvised to investigate the viability of adoptingApache Spark as theircomputational platform, either directly or through theBlaze abstraction layer)

CPython also has another problem that limits the effectiveness of removingthe GIL by switching to fine-grained locking: we use a reference countinggarbage collector with cycle detection.This hurts free threading in two major ways: firstly, any free threadedsolution that retains the reference counting GC will still need a globallock that protects the integrity of the reference counts; secondly, switchingthreads in the CPython runtime will mean updating the reference counts on awhole new working set of objects, almost certainly blowing the CPU cacheand losing some of the speed benefits gained from making more effectiveuse of multiple cores.

So for a truly free-threaded interpreter, the reference counting GC wouldlikely have to go as well, or be replaced with an allocation model that usesa separate heap per thread by default, creating yetanother compatibilityproblem for C extensions (and one that we already know from experience withPyPy, Jython and IronPython poses significant barriers to runtime adoption).

These various factors all combine to explain why it’s unlikely we’ll ever seeCPython’s coarse-graining locking model replaced by a fine-grained lockingmodel within the scope of the CPython project itself:

  • a coarse-grained lock makes threaded code behave in a less surprisingfashion

  • a coarse-grained lock makes the implementation substantially simpler

  • a coarse-grained lock imposes negligible overhead on the scripting use case

  • fine-grained locking provides no benefits to single-threaded code (such asend user scripts)

  • fine-grained locking may break end user code that implicitly relies onCPython’s use of coarse grained locking

  • fine-grained locking provides minimal benefits to event-based codethat uses threads solely to provide asynchronous access to externalsynchronous interfaces (such as web applications using an event basedframework like Twisted or gevent, or GUI applications using the GUI eventloop)

  • fine-grained locking provides minimal benefits to code thatuses other languages like Cython, C or Fortran for the serious numbercrunching (as is common in the NumPy/SciPy community)

  • fine-grained locking provides no substantial benefits to code that needsto scale to multiple machines, and thus cannot rely on shared memory fordata exchange

  • a refcounting GC doesn’t really play well with fine-grained locking(primarily from the point of view of high contention on the lock thatprotects the integrity of the refcounts, but also the bad effects oncaching when switching to different threads and writing to the refcountfields of a new working set of objects)

  • increasing the complexity of the core interpreter implementation for anyreason always poses risks to maintainability, reliability and portability

It isn’t that a free threaded Python implementation that complies with thePython Language and Library References isn’t possible (Jython and IronPythonprove that’s not the case), it’s that free threaded virtual machines arehard to write correctly in the first place and are harder to maintain onceimplemented. For CPython specifically, any engineering effort directed towardsfree threading support is engineering effort that isn’t being directedsomewhere else. The current core development team don’t considerthat to be a good trade-off when there are other far more interesting optionsstill to be explored.

What does the future look like for exploitation of multiple cores in Python?

For CPython, Eric Snow hasstarted workingwith Dr Sarah Mount (at theUniversity of Wolverhamption)to investigate some speculative ideas I published a few years backregarding the possibility ofrefining CPython’s subinterpreter supportto make it a first class language feature that offered truein-process support for parallel exploitation of multiple cores in a way thatdidn’t break compatibility with C extension modules (at least, not any morethan using subinterpreters in combination with extensions that call back intoPython from C created threads already breaks it).

For PyPy, Armin Rigo and others are actively pursuing research into the use ofSoftware Transactional Memory to allow event driven programs to be scaledtransparently across multiple CPU cores. I know he has some thoughts on how theconcepts he is exploring in PyPy could be translated back to CPython, but evenif that doesn’t pan out, it’s very easy to envision a future where CPython isused for command line utilities (which are generally single threaded and oftenso short running that the PyPy JIT never gets a chance to warm up) and embeddedsystems, while PyPy takes over the execution of long running scripts andapplications, letting them run substantially faster and span multiple coreswithout requiring any modifications to the Python code. Splitting the role ofthe two VMs in that fashion would allow each to be optimised appropriatelyrather than having to make trade-offs that attempt to balance the starklydifferent needs of the various use cases.

I also expect we’ll continue to add APIs and features designed to make iteasier to farm work out to other processes (for example, the new iterationof thepickle protocol in Python 3.4 included the ability tounpickle unbound methods by name, which allow them to be used with themultiprocessing APIs).

For data processing workloads, Python users that would prefer something simplerto deploy than Apache Spark, don’t want to compile their own C extensions withCython, and have data which exceeds the capacity of NumPy’s in-memorycalculation model on the systems they have access to, may wish to investigatetheDask project, which aims to offer the featuresof core components of the Scientific Python ecosystem(notably, NumPy and Pandas) in a form which is limited by the capacity of localdisk storage, rather than the capacity of local memory.

Another potentially interesting project isTrent Nelson’s PyParallel work onusing memory page locking to permit the creation of “shared nothing” workerthreads, that would permit the use of a more Rust-style memory model withinCPython without introducing a distinct subinterpreter based parallel executionmodel.

Alex Gaynor also pointed outsome interesting research (PDF)into replacing Ruby’s Giant VM Lock (the equivalent to CPython’s GIL inCRuby, aka the Matz Ruby Interpreter) with appropriate use of HardwareTransactional Memory, which may also prove relevant to CPython as HTMcapable hardware becomes more common. (However, note the difficulties thatthe refcounting in MRI caused the researchers - CPython is likely to haveexactly the same problem, with a well established history of attempting toeliminate and then emulate the refcounting causing major compatibilityproblems with extension modules).

On this page

[8]ページ先頭

©2009-2025 Movatter.jp