Python Enhancement Proposals

Python »
PEP Index »
PEP 709

PEP 709 – Inlined comprehensions

Author:: Carl Meyer <carl at oddbird.net>
Sponsor:: Guido van Rossum <guido at python.org>
Discussions-To:

Table of Contents

Abstract

Comprehensions are currently compiled as nested functions, which providesisolation of the comprehension’s iteration variable, but is inefficient atruntime. This PEP proposes to inline list, dictionary, and set comprehensionsinto the code where they are defined, and provide the expected isolation bypushing/popping clashing locals on the stack. This change makes comprehensionsmuch faster: up to 2x faster for a microbenchmark of a comprehension alone,translating to an 11% speedup for one sample benchmark derived from real-worldcode that makes heavy use of comprehensions in the context of doing actual work.

Motivation

Comprehensions are a popular and widely-used feature of the Python language.The nested-function compilation of comprehensions optimizes for compilersimplicity at the expense of performance of user code. It is possible toprovide near-identical semantics (seeBackwards Compatibility) with muchbetter runtime performance for all users of comprehensions, with only a smallincrease in compiler complexity.

Rationale

Inlining is a common compiler optimization in many languages. Generalizedinlining of function calls at compile time in Python is near-impossible, sincecall targets may be patched at runtime. Comprehensions are a special case,where we have a call target known statically in the compiler that can neitherbe patched (barring undocumented and unsupported fiddling with bytecodedirectly) nor escape.

Inlining also permits other compiler optimizations of bytecode to be moreeffective, because they can now “see through” the comprehension bytecode,instead of it being an opaque call.

Normally a performance improvement would not require a PEP. In this case, thesimplest and most efficient implementation results in some user-visible effects,so this is not just a performance improvement, it is a (small) change to thelanguage.

Specification

Given a simple comprehension:

deff(lst):return[xforxinlst]

The compiler currently emits the following bytecode for the functionf:

1           0 RESUME                   02           2 LOAD_CONST               1 (<code object <listcomp> at 0x...)            4 MAKE_FUNCTION            0            6 LOAD_FAST                0 (lst)            8 GET_ITER           10 CALL                     0           20 RETURN_VALUEDisassembly of <code object <listcomp> at 0x...>:2           0 RESUME                   0            2 BUILD_LIST               0            4 LOAD_FAST                0 (.0)      >>    6 FOR_ITER                 4 (to 18)           10 STORE_FAST               1 (x)           12 LOAD_FAST                1 (x)           14 LIST_APPEND              2           16 JUMP_BACKWARD            6 (to 6)      >>   18 END_FOR           20 RETURN_VALUE

The bytecode for the comprehension is in a separate code object. Each timef() is called, a new single-use function object is allocated (byMAKE_FUNCTION), called (allocating and then destroying a new frame on thePython stack), and then immediately thrown away.

Under this PEP, the compiler will emit the following bytecode forf()instead:

1           0 RESUME                   02           2 LOAD_FAST                0 (lst)            4 GET_ITER            6 LOAD_FAST_AND_CLEAR      1 (x)            8 SWAP                     2           10 BUILD_LIST               0           12 SWAP                     2      >>   14 FOR_ITER                 4 (to 26)           18 STORE_FAST               1 (x)           20 LOAD_FAST                1 (x)           22 LIST_APPEND              2           24 JUMP_BACKWARD            6 (to 14)      >>   26 END_FOR           28 SWAP                     2           30 STORE_FAST               1 (x)           32 RETURN_VALUE

There is no longer a separate code object, nor creation of a single-use functionobject, nor any need to create and destroy a Python frame.

Isolation of thex iteration variable is achieved by the combination of thenewLOAD_FAST_AND_CLEAR opcode at offset6, which saves any outer valueofx on the stack before running the comprehension, and30STORE_FAST,which restores the outer value ofx (if any) after running thecomprehension.

If the comprehension accesses variables from the outer scope, inlining avoidsthe need to place these variables in a cell, allowing the comprehension (and allother code in the outer function) to access them as normal fast locals instead.This provides further performance gains.

In some cases, the comprehension iteration variable may be a global or cellvaror freevar, rather than a simple function local, in the outer scope. In thesecases, the compiler also internally pushes and pops the scope information forthe variable when entering/leaving the comprehension, so that semantics aremaintained. For example, if the variable is a global outside the comprehension,LOAD_GLOBAL will still be used where it is referenced outside thecomprehension, butLOAD_FAST /STORE_FAST will be used within thecomprehension. If it is a cellvar/freevar outside the comprehension, theLOAD_FAST_AND_CLEAR /STORE_FAST used to save/restore it do not change(there is noLOAD_DEREF_AND_CLEAR), meaning that the entire cell (not justthe value within it) is saved/restored, so the comprehension does not write tothe outer cell.

Comprehensions occurring in module or class scope are also inlined. In thiscase, the comprehension will introduce usage of fast-locals (LOAD_FAST /STORE_FAST) for the comprehension iteration variable within thecomprehension only, in a scope where otherwise onlyLOAD_NAME /STORE_NAME would be used, maintaining isolation.

In effect, comprehensions introduce a sub-scope where local variables are fullyisolated, but without the performance cost or stack frame entry of a call.

Generator expressions are currently not inlined in the reference implementationof this PEP. In the future, some generator expressions may be inlined, where thereturned generator object does not leak.

Asynchronous comprehensions are inlined the same as synchronous ones; no specialhandling is needed.

Backwards Compatibility

Comprehension inlining will cause the following visible behavior changes. Nochanges in the standard library or test suite were necessary to adapt to thesechanges in the implementation, suggesting the impact in user code is likely tobe minimal.

Specialized tools depending on undocumented details of compiler bytecode outputmay of course be affected in ways beyond the below, but these tools already mustadapt to bytecode changes in each Python version.

locals() includes outer variables

Callinglocals() within a comprehension will include all locals of thefunction containing the comprehension. E.g. given the following function:

deff(lst):return[locals()forxinlst]

Callingf([1]) in current Python will return:

[{'.0':<list_iteratorobjectat0x7f8d37170460>,'x':1}]

where.0 is an internal implementation detail: the synthetic sole argumentto the comprehension “function”.

Under this PEP, it will instead return:

[{'lst':[1],'x':1}]

This now includes the outerlst variable as a local, and eliminates thesynthetic.0.

No comprehension frame in tracebacks

Under this PEP, a comprehension will no longer have its own dedicated frame ina stack trace. For example, given this function:

defg():raiseRuntimeError("boom")deff():return[g()forxin[1]]

Currently, callingf() results in the following traceback:

Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "<stdin>", line 5, in f  File "<stdin>", line 5, in <listcomp>  File "<stdin>", line 2, in gRuntimeError: boom

Note the dedicated frame for<listcomp>.

Under this PEP, the traceback looks like this instead:

Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "<stdin>", line 5, in f  File "<stdin>", line 2, in gRuntimeError: boom

There is no longer an extra frame for the list comprehension. The frame for thef function has the correct line number for the comprehension, however, sothis simply makes the traceback more compact without losing any usefulinformation.

It is theoretically possible that code using warnings with thestacklevelargument could observe a behavior change due to the frame stack change. Inpractice, however, this seems unlikely. It would require a warning raised inlibrary code that is always called through a comprehension in that samelibrary, where the warning is using astacklevel of 3+ to bypass thecomprehension and its containing function and point to a calling frame outsidethe library. In such a scenario it would usually be simpler and more reliableto raise the warning closer to the calling code and bypass fewer frames.

Tracing/profiling will no longer show a call/return for the comprehension

Naturally, since list/dict/set comprehensions will no longer be implemented as acall to a nested function, tracing/profiling usingsys.settrace orsys.setprofile will also no longer reflect that a call and return haveoccurred.

Impact on other Python implementations

Per comments from representatives ofGraalPython andPyPy,they would likely feel the need to adapt to the observable behavior changeshere, given the likelihood that someone, at some point, will depend on them.Thus, all else equal, fewer observable changes would be less work. But thesechanges (at least in the case of GraalPython) should be manageable “without muchheadache”.

How to Teach This

It is not intuitively obvious that comprehension syntax will or should resultin creation and call of a nested function. For new users not already accustomedto the prior behavior, I suspect the new behavior in this PEP will be moreintuitive and require less explanation. (“Why is there a<listcomp> line inmy traceback when I didn’t define any such function? What is this.0variable I see inlocals()?”)

Security Implications

None known.

Reference Implementation

This PEP has a reference implementation in the form ofa PR against the CPython mainbranch which passes all tests.

The reference implementation performs the micro-benchmark./python-mpyperftimeit-s'l=[1]''[xforxinl]' 1.96x faster than themain branch (in abuild compiled with--enable-optimizations.)

The reference implementation performs thecomprehensions benchmark in thepyperformance benchmark suite(which is not a micro-benchmark of comprehensions alone, but testsreal-world-derived code doing realistic work using comprehensions) 11% fasterthanmain branch (again in optimized builds). Other benchmarks inpyperformance (none of which use comprehensions heavily) don’t show any impactoutside the noise.

The implementation has no impact on non-comprehension code.

Rejected Ideas

More efficient comprehension calling, without inlining

Analternate approachintroduces a new opcode for “calling” a comprehension in streamlined fashionwithout the need to create a throwaway function object, but still creating a newPython frame. This avoids all of the visible effects listed underBackwardsCompatibility, and provides roughly half of the performance benefit (1.5ximprovement on the microbenchmark, 4% improvement oncomprehensionsbenchmark in pyperformance.) It also requires adding a new pointer to the_PyInterpreterFrame struct and a newPy_INCREF on each frameconstruction, meaning (unlike this PEP) it has a (very small) performance costfor all code. It also provides less scope for future optimizations.

This PEP takes the position that full inlining offers sufficient additionalperformance to more than justify the behavior changes.

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.

Source:https://github.com/python/peps/blob/main/peps/pep-0709.rst

Last modified:2023-12-15 15:06:12 GMT

Movatterモバイル変換

PEP 709 – Inlined comprehensions