Python Enhancement Proposals

Python »
PEP Index »
PEP 289

PEP 289 – Generator Expressions

Author:: Raymond Hettinger <python at rcn.com>
Status:

Abstract

This PEP introduces generator expressions as a high performance,memory efficient generalization of list comprehensionsPEP 202 andgeneratorsPEP 255.

Experience with list comprehensions has shown their widespreadutility throughout Python. However, many of the use cases donot need to have a full list created in memory. Instead, theyonly need to iterate over the elements one at a time.

For instance, the following summation code will build a full list ofsquares in memory, iterate over those values, and, when the referenceis no longer needed, delete the list:

sum([x*xforxinrange(10)])

Memory is conserved by using a generator expression instead:

sum(x*xforxinrange(10))

Similar benefits are conferred on constructors for container objects:

s=set(wordforlineinpageforwordinline.split())d=dict((k,func(k))forkinkeylist)

Generator expressions are especially useful with functions like sum(),min(), and max() that reduce an iterable input to a single value:

max(len(line)forlineinfileifline.strip())

Generator expressions also address some examples of functionals codedwith lambda:

reduce(lambdas,a:s+a.myattr,data,0)reduce(lambdas,a:s+a[3],data,0)

These simplify to:

sum(a.myattrforaindata)sum(a[3]foraindata)

List comprehensions greatly reduced the need for filter() and map().Likewise, generator expressions are expected to minimize the needfor itertools.ifilter() and itertools.imap(). In contrast, theutility of other itertools will be enhanced by generator expressions:

dotproduct=sum(x*yforx,yinitertools.izip(x_vector,y_vector))

Having a syntax similar to list comprehensions also makes it easy toconvert existing code into a generator expression when scaling upapplication.

Early timings showed that generators had a significant performanceadvantage over list comprehensions. However, the latter were highlyoptimized for Py2.4 and now the performance is roughly comparablefor small to mid-sized data sets. As the data volumes grow larger,generator expressions tend to perform better because they do notexhaust cache memory and they allow Python to re-use objects betweeniterations.

BDFL Pronouncements

This PEP is ACCEPTED for Py2.4.

The Details

(None of this is exact enough in the eye of a reader from Mars, but Ihope the examples convey the intention well enough for a discussion inc.l.py. The Python Reference Manual should contain a 100% exactsemantic and syntactic specification.)

The semantics of a generator expression are equivalent to creatingan anonymous generator function and calling it. For example:
```
g=(x**2forxinrange(10))printg.next()
```
is equivalent to:
```
def__gen(exp):forxinexp:yieldx**2g=__gen(iter(range(10)))printg.next()
```
Only the outermost for-expression is evaluated immediately, the otherexpressions are deferred until the generator is run:
```
g=(tgtexpforvar1inexp1ifexp2forvar2inexp3ifexp4)
```
is equivalent to:
```
def__gen(bound_exp):forvar1inbound_exp:ifexp2:forvar2inexp3:ifexp4:yieldtgtexpg=__gen(iter(exp1))del__gen
```
The syntax requires that a generator expression always needs to bedirectly inside a set of parentheses and cannot have a comma oneither side. With reference to the file Grammar/Grammar in CVS,two rules change:
1. The rule:
```
atom:'('[testlist]')'
```
  changes to:
```
atom:'('[testlist_gexp]')'
```
  where testlist_gexp is almost the same as listmaker, but onlyallows a single test after ‘for’ … ‘in’:
```
testlist_gexp:test(gen_for|(','test)*[','])
```
2. The rule for arglist needs similar changes.
This means that you can write:
```
sum(x**2forxinrange(10))
```
but you would have to write:
```
reduce(operator.add,(x**2forxinrange(10)))
```
and also:
```
g=(x**2forxinrange(10))
```
i.e. if a function call has a single positional argument, it can bea generator expression without extra parentheses, but in all othercases you have to parenthesize it.
The exact details were checked in to Grammar/Grammar version 1.49.
The loop variable (if it is a simple variable or a tuple of simplevariables) is not exposed to the surrounding function. Thisfacilitates the implementation and makes typical use cases morereliable. In some future version of Python, list comprehensionswill also hide the induction variable from the surrounding code(and, in Py2.4, warnings will be issued for code accessing theinduction variable).
For example:
```
x="hello"y=list(xforxin"abc")printx# prints "hello", not "c"
```
List comprehensions will remain unchanged. For example:
```
[xforxinS]# This is a list comprehension.[(xforxinS)]# This is a list containing one generator# expression.
```
Unfortunately, there is currently a slight syntactic difference.The expression:
```
[xforxin1,2,3]
```
is legal, meaning:
```
[xforxin(1,2,3)]
```
But generator expressions will not allow the former version:
```
(xforxin1,2,3)
```
is illegal.
The former list comprehension syntax will become illegal in Python3.0, and should be deprecated in Python 2.4 and beyond.
List comprehensions also “leak” their loop variable into thesurrounding scope. This will also change in Python 3.0, so thatthe semantic definition of a list comprehension in Python 3.0 willbe equivalent to list(<generator expression>). Python 2.4 andbeyond should issue a deprecation warning if a list comprehension’sloop variable has the same name as a variable used in theimmediately surrounding scope.

Early Binding versus Late Binding

After much discussion, it was decided that the first (outermost)for-expression should be evaluated immediately and that the remainingexpressions be evaluated when the generator is executed.

Asked to summarize the reasoning for binding the first expression,Guido offered[1]:

Consider sum(x for x in foo()). Now suppose there's a bug in foo()that raises an exception, and a bug in sum() that raises anexception before it starts iterating over its argument. Whichexception would you expect to see? I'd be surprised if the one insum() was raised rather the one in foo(), since the call to foo()is part of the argument to sum(), and I expect arguments to beprocessed before the function is called.OTOH, in sum(bar(x) for x in foo()), where sum() and foo()are bugfree, but bar() raises an exception, we have no choice butto delay the call to bar() until sum() starts iterating -- that'spart of the contract of generators. (They do nothing until theirnext() method is first called.)

Various use cases were proposed for binding all free variables whenthe generator is defined. And some proponents felt that the resultingexpressions would be easier to understand and debug if bound immediately.

However, Python takes a late binding approach to lambda expressions andhas no precedent for automatic, early binding. It was felt thatintroducing a new paradigm would unnecessarily introduce complexity.

After exploring many possibilities, a consensus emerged that bindingissues were hard to understand and that users should be stronglyencouraged to use generator expressions inside functions that consumetheir arguments immediately. For more complex applications, fullgenerator definitions are always superior in terms of being obviousabout scope, lifetime, and binding[2].

Reduction Functions

The utility of generator expressions is greatly enhanced when combinedwith reduction functions like sum(), min(), and max(). The heapqmodule in Python 2.4 includes two new reduction functions: nlargest()and nsmallest(). Both work well with generator expressions and keepno more than n items in memory at one time.

Acknowledgements

Raymond Hettinger first proposed the idea of “generatorcomprehensions” in January 2002.
Peter Norvig resurrected the discussion in his proposal forAccumulation Displays.
Alex Martelli provided critical measurements that proved theperformance benefits of generator expressions. He also providedstrong arguments that they were a desirable thing to have.
Phillip Eby suggested “iterator expressions” as the name.
Subsequently, Tim Peters suggested the name “generator expressions”.
Armin Rigo, Tim Peters, Guido van Rossum, Samuele Pedroni,Hye-Shik Chang and Raymond Hettinger teased out the issues surroundingearly versus late binding[1].
Jiwon Seo single-handedly implemented various versions of the proposalincluding the final version loaded into CVS. Along the way, therewere periodic code reviews by Hye-Shik Chang and Raymond Hettinger.Guido van Rossum made the key design decisions after comments fromArmin Rigo and newsgroup discussions. Raymond Hettinger providedthe test suite, documentation, tutorial, and examples[2].