Python Enhancement Proposals

Python »
PEP Index »
PEP 659

PEP 659 – Specializing Adaptive Interpreter

Author:: Mark Shannon <mark at hotpy.org>
Status:

Important

This PEP is a historical document. The up-to-date, canonical documentation can now be found atSpecializing Adaptive Interpreter.

SeePEP 1 for how to propose changes.

Abstract

In order to perform well, virtual machines for dynamic languages mustspecialize the code that they execute to the types and values in theprogram being run. This specialization is often associated with “JIT”compilers, but is beneficial even without machine code generation.

A specializing, adaptive interpreter is one that speculatively specializeson the types or values it is currently operating on, and adapts to changesin those types and values.

Specialization gives us improved performance, and adaptation allows theinterpreter to rapidly change when the pattern of usage in a program alters,limiting the amount of additional work caused by mis-specialization.

This PEP proposes using a specializing, adaptive interpreter that specializescode aggressively, but over a very small region, and is able to adjust tomis-specialization rapidly and at low cost.

Adding a specializing, adaptive interpreter to CPython will bring significantperformance improvements. It is hard to come up with meaningful numbers,as it depends very much on the benchmarks and on work that has not yet happened.Extensive experimentation suggests speedups of up to 50%.Even if the speedup were only 25%, this would still be a worthwhile enhancement.

Motivation

Python is widely acknowledged as slow.Whilst Python will never attain the performance of low-level languages like C,Fortran, or even Java, we would like it to be competitive with fastimplementations of scripting languages, like V8 for Javascript or luajit forlua.Specifically, we want to achieve these performance goals with CPython tobenefit all users of Python including those unable to use PyPy orother alternative virtual machines.

Achieving these performance goals is a long way off, and will require a lot ofengineering effort, but we can make a significant step towards those goals byspeeding up the interpreter.Both academic research and practical implementations have shown that a fastinterpreter is a key part of a fast virtual machine.

Typical optimizations for virtual machines are expensive, so a long “warm up”time is required to gain confidence that the cost of optimization is justified.In order to get speed-ups rapidly, without noticeable warmup times,the VM should speculate that specialization is justified even after a fewexecutions of a function. To do that effectively, the interpreter must be ableto optimize and de-optimize continually and very cheaply.

By using adaptive and speculative specialization at the granularity ofindividual virtual machine instructions,we get a faster interpreter that also generates profiling informationfor more sophisticated optimizations in the future.

Rationale

There are many practical ways to speed-up a virtual machine for a dynamiclanguage.However, specialization is the most important, both in itself and as anenabler of other optimizations.Therefore it makes sense to focus our efforts on specialization first,if we want to improve the performance of CPython.

Specialization is typically done in the context of a JIT compiler,but research shows specialization in an interpreter can boost performancesignificantly, even outperforming a naive compiler[1].

There have been several ways of doing this proposed in the academicliterature, but most attempt to optimize regions larger than asingle bytecode[1][2].Using larger regions than a single instruction requires code to handlede-optimization in the middle of a region.Specialization at the level of individual bytecodes makes de-optimizationtrivial, as it cannot occur in the middle of a region.

By speculatively specializing individual bytecodes, we can gain significantperformance improvements without anything but the most local,and trivial to implement, de-optimizations.

The closest approach to this PEP in the literature is“Inline Caching meets Quickening”[3].This PEP has the advantages of inline caching,but adds the ability to quickly de-optimize making the performancemore robust in cases where specialization fails or is not stable.

Performance

The speedup from specialization is hard to determine, as many specializationsdepend on other optimizations. Speedups seem to be in the range 10% - 60%.

Most of the speedup comes directly from specialization. The largestcontributors are speedups to attribute lookup, global variables, and calls.
A small, but useful, fraction is from improved dispatch such assuper-instructions and other optimizations enabled by quickening.

Implementation

Overview

Any instruction that would benefit from specialization will be replaced by an“adaptive” form of that instruction. When executed, the adaptive instructionswill specialize themselves in response to the types and values that they see.This process is known as “quickening”.

Once an instruction in a code object has executed enough times,that instruction will be “specialized” by replacing it with a new instructionthat is expected to execute faster for that operation.

Quickening

Quickening is the process of replacing slow instructions with faster variants.

Quickened code has a number of advantages over immutable bytecode:

It can be changed at runtime.
It can use super-instructions that span lines and take multiple operands.
It does not need to handle tracing as it can fallback to the originalbytecode for that.

In order that tracing can be supported, the quickened instruction formatshould match the immutable, user visible, bytecode format:16-bit instructions of 8-bit opcode followed by 8-bit operand.

Adaptive instructions

Each instruction that would benefit from specialization is replaced by anadaptive version during quickening. For example,theLOAD_ATTR instruction would be replaced withLOAD_ATTR_ADAPTIVE.

Each adaptive instruction periodically attempts to specialize itself.

Specialization

CPython bytecode contains many instructions that represent high-leveloperations, and would benefit from specialization. Examples includeCALL,LOAD_ATTR,LOAD_GLOBAL andBINARY_ADD.

By introducing a “family” of specialized instructions for each of theseinstructions allows effective specialization,since each new instruction is specialized to a single task.Each family will include an “adaptive” instruction, that maintains a counterand attempts to specialize itself when that counter reaches zero.

Each family will also include one or more specialized instructions thatperform the equivalent of the generic operation much faster provided theirinputs are as expected.Each specialized instruction will maintain a saturating counter which willbe incremented whenever the inputs are as expected. Should the inputs notbe as expected, the counter will be decremented and the generic operationwill be performed.If the counter reaches the minimum value, the instruction is de-optimized bysimply replacing its opcode with the adaptive version.

Ancillary data

Most families of specialized instructions will require more information thancan fit in an 8-bit operand. To do this, a number of 16 bit entries immediatelyfollowing the instruction are used to store this data. This is a form of inlinecache, an “inline data cache”. Unspecialized, or adaptive, instructions willuse the first entry of this cache as a counter, and simply skip over the others.

Example families of instructions

LOAD_ATTR

TheLOAD_ATTR instruction loads the named attribute of the object on top of the stack,then replaces the object on top of the stack with the attribute.

This is an obvious candidate for specialization. Attributes might belong toa normal instance, a class, a module, or one of many other special cases.

LOAD_ATTR would initially be quickened toLOAD_ATTR_ADAPTIVE whichwould track how often it is executed, and call the_Py_Specialize_LoadAttrinternal function when executed enough times, or jump to the originalLOAD_ATTR instruction to perform the load. When optimizing, the kindof the attribute would be examined, and if a suitable specialized instructionwas found, it would replaceLOAD_ATTR_ADAPTIVE in place.

Specialization forLOAD_ATTR might include:

LOAD_ATTR_INSTANCE_VALUE A common case where the attribute is stored inthe object’s value array, and not shadowed by an overriding descriptor.
LOAD_ATTR_MODULE Load an attribute from a module.
LOAD_ATTR_SLOT Load an attribute from an object whoseclass defines__slots__.

Note how this allows optimizations that complement other optimizations.TheLOAD_ATTR_INSTANCE_VALUE works well with the “lazy dictionary” used formany objects.

LOAD_GLOBAL

TheLOAD_GLOBAL instruction looks up a name in the global namespaceand then, if not present in the global namespace,looks it up in the builtins namespace.In 3.9 the C code for theLOAD_GLOBAL includes code to check to seewhether the whole code object should be modified to add a cache,whether either the global or builtins namespace,code to lookup the value in a cache, and fallback code.This makes it complicated and bulky.It also performs many redundant operations even when supposedly optimized.

Using a family of instructions makes the code more maintainable and faster,as each instruction only needs to handle one concern.

Specializations would include:

LOAD_GLOBAL_ADAPTIVE would operate likeLOAD_ATTR_ADAPTIVE above.
LOAD_GLOBAL_MODULE can be specialized for the case where the value is inthe globals namespace. After checking that the keys of the namespace havenot changed, it can load the value from the stored index.
LOAD_GLOBAL_BUILTIN can be specialized for the case where the value isin the builtins namespace. It needs to check that the keys of the globalnamespace have not been added to, and that the builtins namespace has notchanged. Note that we don’t care if the values of the global namespacehave changed, just the keys.

See[4] for a full implementation.

Note

This PEP outlines the mechanisms for managing specialization, and does notspecify the particular optimizations to be applied.It is likely that details, or even the entire implementation, may changeas the code is further developed.

Compatibility

There will be no change to the language, library or API.

The only way that users will be able to detect the presence of the newinterpreter is through timing execution, the use of debugging tools,or measuring memory use.

Costs

Memory use

An obvious concern with any scheme that performs any sort of caching is“how much more memory does it use?”.The short answer is “not that much”.

Comparing memory use to 3.10

CPython 3.10 used 2 bytes per instruction, until the execution countreached ~2000 when it allocates another byte per instruction and32 bytes per instruction with a cache (LOAD_GLOBAL andLOAD_ATTR).

The following table shows the additional bytes per instruction to support the3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.

Version	3.10 cold	3.10 hot	3.11
Specialised	0%	~15%	~25%
code	2	2	2
opcache_map	0	1	0
opcache/data	0	4.8	4
Total	2	7.8	6

3.10cold is before the code has reached the ~2000 limit.3.10hot shows the cache use once the threshold is reached.

The relative memory use depends on how much code is “hot” enough to triggercreation of the cache in 3.10. The break even point, where the memory usedby 3.10 is the same as for 3.11 is ~70%.

It is also worth noting that the actual bytecode is only part of a codeobject. Code objects also include names, constants and quite a lot ofdebugging information.

In summary, for most applications where many of the functions are relativelyunused, 3.11 will consume more memory than 3.10, but not by much.

Security Implications

None

Rejected Ideas

By implementing a specializing adaptive interpreter with inline data caches,we are implicitly rejecting many alternative ways to optimize CPython.However, it is worth emphasizing that some ideas, such as just-in-timecompilation, have not been rejected, merely deferred.

Storing data caches before the bytecode.

An earlier implementation of this PEP for 3.11 alpha used a different cachingscheme as described below:

Quickened instructions will be stored in an array (it is neither necessary notdesirable to store them in a Python object) with the same format as theoriginal bytecode. Ancillary data will be stored in a separate array.
Each instruction will use 0 or more data entries.Each instruction within a family must have the same amount of data allocated,although some instructions may not use all of it.Instructions that cannot be specialized, e.g.POP_TOP,do not need any entries.Experiments show that 25% to 30% of instructions can be usefully specialized.Different families will need different amounts of data,but most need 2 entries (16 bytes on a 64 bit machine).
In order to support larger functions than 256 instructions,we compute the offset of the first data entry for instructionsas(instructionoffset)//2+(quickenedoperand).
Compared to the opcache in Python 3.10, this design:
is faster; it requires no memory reads to compute the offset.3.10 requires two reads, which are dependent.
uses much less memory, as the data can be different sizes for differentinstruction families, and doesn’t need an additional array of offsets.can support much larger functions, up to about 5000 instructionsper function. 3.10 can support about 1000.

We rejected this scheme as the inline cache approach is both fasterand simpler.

References

[1] (1,2)

The construction of high-performance virtual machines fordynamic languages, Mark Shannon 2011.https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf

[2]

Dynamic Interpretation for Dynamic Scripting Languageshttps://www.scss.tcd.ie/publications/tech-reports/reports.09/TCD-CS-2009-37.pdf

[3]

Inline Caching meets Quickeninghttps://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view

[4]

The adaptive and specialized instructions are implemented inhttps://github.com/python/cpython/blob/main/Python/ceval.c

The optimizations are implemented in:https://github.com/python/cpython/blob/main/Python/specialize.c

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.

Source:https://github.com/python/peps/blob/main/peps/pep-0659.rst

Last modified:2024-10-29 10:45:35 GMT

Movatterモバイル変換

PEP 659 – Specializing Adaptive Interpreter