Movatterモバイル変換

[Python-Dev] Reworking the GIL

Antoine Pitrousolipsis at pitrou.net
Sun Oct 25 21:22:20 CET 2009
Previous message:[Python-Dev] Buildbot alternate renderings
Next message:[Python-Dev] Reworking the GIL
Messages sorted by:[ date ][ thread ][ subject ][ author ]
Hello there,The last couple of days I've been working on an experimental rewrite ofthe GIL. Since the work has been turning out rather successful (or, atleast, not totally useless and crashing!) I thought I'd announce ithere.First I want to stress this is not about removing the GIL. There stillis a Global Interpreter Lock which serializes access to most parts ofthe interpreter. These protected parts haven't changed either, so Pythondoesn't become really better at extracting computational parallelism outof several cores.Goals-----The new GIL (which is also the name of the sandbox area I've committedit in, "newgil") addresses the following issues :1) Switching by opcode counting. Counting opcodes is a very crude way ofestimating times, since the time spent executing a single opcode canvery wildly. Litterally, an opcode can be as short as a handful ofnanoseconds (think something like "... is not None") or as long as afraction of second, or even longer (think calling a heavy non-GILreleasing C function, such as re.search()). Therefore, releasing the GILevery 100 opcodes, regardless of their length, is a very poor policy.The new GIL does away with this by ditching _Py_Ticker entirely andinstead using a fixed interval (by default 5 milliseconds, but settable)after which we ask the main thread to release the GIL and let anotherthread be scheduled.2) GIL overhead and efficiency in contended situations. Apparently, someOSes (OS X mainly) have problems with lock performance when the lock isalready taken: the system calls are heavy. This is the "Dave Beazleyeffect", where he took a very trivial loop, therefore made of very shortopcodes and therefore releasing the GIL very often (probably 100000times a second), and runs it in one or two threads on an OS with poorlock performance (OS X). He sees a 50% increase in runtime when usingtwo threads rather than one, in what is admittedly a pathological case.Even on better platforms such as Linux, eliminating the overhead of manyGIL acquires and releases (since the new GIL is released on a fixed timebasis rather than on an opcode counting basis) yields slightly betterperformance (read: a smaller performance degradation :-)) when there areseveral pure Python computation threads running.3) Thread switching latency. The traditional scheme merely releases theGIL for a couple of CPU cycles, and reacquires it immediately.Unfortunately, this doesn't mean the OS will automatically switch toanother, GIL-awaiting thread. In many situations, the same thread willcontinue running. This, with the opcode counting scheme, is the reasonwhy some people have been complaining about latency problems when an I/Othread competes with a computational thread (the I/O thread wouldn't bescheduled right away when e.g. a packet arrives; or rather, it would bescheduled by the OS, but unscheduled immediately when trying to acquirethe GIL, and it would be scheduled again only much later).The new GIL improves on this by combinating two mechanisms:- forced thread switching, which means that when the switching intervalis terminated (mentioned in 1) and the GIL is released, we will forceany of the threads waiting on the GIL to be scheduled instead of theformerly GIL-holding thread. Which thread exactly is an OS decision,however: the goal here is not to have our own scheduler (this could bediscussed but I wanted the design to remain simple :-) After all,man-years of work have been invested in scheduling algorithms by kernelprogramming teams).- priority requests, which is an option for a thread requesting the GILto be scheduled as soon as possible, and forcibly (rather than any otherthreads). This is meant to be used by GIL-releasing methods such asread() on files and sockets. The scheme, again, is very simple: when apriority request is done by a thread, the GIL is released as soon aspossible by the thread holding it (including in the eval loop), and thenthe thread making the priority request is forcibly scheduled (by makingall other GIL-awaiting threads wait in the meantime).Implementation--------------The new GIL is implemented using a couple of mutexes and conditionvariables. A {mutex, condition} pair is used to protect the GIL itself,which is a mere variable named `gil_locked` (there are a couple of othervariables for bookkeeping). Another {mutex, condition} pair is used forforced thread switching (described above). Finally, a separate mutex isused for priority requests (described above).The code is in the sandbox:http://svn.python.org/view/sandbox/trunk/newgil/The file of interest is Python/ceval_gil.h. Changes in other files arevery minimal, except for priority requests which have been added atstrategic places (some methods of I/O modules). Also, the code remainsrather short, while of course being less trivial than the old one.NB : this is a branch of py3k. There should be no real difficultyporting it back to trunk, provided someone wants to do the job.Platforms---------I've implemented the new GIL for POSIX and Windows (tested under Linuxand Windows XP (running in a VM)). Judging by what I can read in theonline MSDN docs, the Windows support should include everything fromWindows 2000, and probably recent versions of Windows CE.Other platforms aren't implemented, because I don't have access to thenecessary hardware. Besides, I must admit I'm not very motivated inworking on niche/obsolete systems. I've e-mailed Andrew MacIntyre inprivate to ask him if he'd like to do the OS/2 support.Supporting a new platform is not very difficult: it's a matter ofwriting the 50-or-so lines of necessary platform-specific macros at thebeginning of Python/ceval_gil.h.The reason I couldn't use the existing thread support(Python/thread_*.h) is that these abstractions are too poor. Mainly,they don't provide:- events, conditions or an equivalent thereof- the ability to acquire a resource with a timeoutMeasurements------------Before starting this work, I wrote ccbench (*), a little benchmarkscript ("ccbench" being a shorthand for "concurrency benchmark") whichmeasures two things:- computation throughput with one or several concurrent threads- latency to external events (I use an UDP socket) when there is zero,one, or several background computation threads running(*)http://svn.python.org/view/sandbox/trunk/ccbench/The benchmark involves several computation workloads with different GILcharacteristics. By default there are 3 of them:A- one pure Python workload (computation of a number of digits of pi):that is, something which spends its time in the eval loopB- one mostly C workload where the C implementation doesn't release theGIL (regular expression matching)C- one mostly C workload where the implementation does release the GIL(bz2 compression)In the ccbench directory you will find benchmark results, under Linux,for two different systems I have here. The new GIL shows roughly similarbut slightly better throughput results than the old one. And it is muchbetter in the latency tests, especially in workload B (going down fromalmost a second of average latency with the old GIL, to a couple ofmilliseconds with the new GIL). This is the combined result of using atime-based scheme (rather than opcode-based) and of forced threadswitching (rather than relying on the OS to actually switch threads whenwe speculatively release the GIL).As a sidenote, I might mention that single-threaded performance is notdegraded at all. It is, actually, theoretically a bit better because theold ticker check in the eval loop becomes simpler; however, this goesmostly unnoticed.Now what remains to be done?Having other people test it would be fine. Even better if you have anactual multi-threaded py3k application. But ccbench results for otherOSes would be nice too :-)(I get good results under the Windows XP VM but I feel that a VM is notan ideal setup for a concurrency benchmark)Of course, studying and reviewing the code is welcome. As forintegrating it into the mainline py3k branch, I guess we have to answerthese questions:- is the approach interesting? (we could decide that it's just not worthit, and that a good GIL can only be a dead (removed) GIL)- is the patch good, mature and debugged enough?- how do we deal with the unsupported platforms (POSIX and Windowssupport should cover most bases, but the fate of OS/2 support depends onAndrew)?RegardsAntoine.
Previous message:[Python-Dev] Buildbot alternate renderings
Next message:[Python-Dev] Reworking the GIL
Messages sorted by:[ date ][ thread ][ subject ][ author ]
More information about the Python-Devmailing list
[8]ページ先頭