Important
This PEP is a historical document. The up-to-date, canonical documentation can now be found atFull Grammar specification.
×
SeePEP 1 for how to propose changes.
This PEP proposes replacing the current LL(1)-based parser of CPythonwith a new PEG-based parser. This new parser would allow the elimination of multiple“hacks” that exist in the current grammar to circumvent the LL(1)-limitation.It would substantially reduce the maintenance costs in some areas related to thecompiling pipeline such as the grammar, the parser and the AST generation. The new PEGparser will also lift the LL(1) restriction on the current Python grammar.
The current Python grammar is an LL(1)-based grammar. A grammar can be said to beLL(1) if it can be parsed by an LL(1) parser, which in turn is defined as atop-down parser that parses the input from left to right, performing leftmostderivation of the sentence, with just one token of lookahead.The traditional approach to constructing or generating an LL(1) parser is toproduce aparse table which encodes the possible transitions between all possiblestates of the parser. These tables are normally constructed from thefirst setsand thefollow sets of the grammar:
rule:A|B
if onlyA can start with the terminala and onlyB can start with theterminalb and the parser sees the tokenb when parsing this rule, it knowsthat it needs to follow the non-terminalB.
rule:A'b'
if the parser has the tokenb and the non-terminalA can only startwith the tokena, then the parser can tell that this is an invalid program.But ifA could expand to the empty string (called an ε-production),then the parser would recognise a valid emptyA,since the next tokenb is in thefollow set ofA.
The current Python grammar does not contain ε-productions, so thefollow sets are notneeded when creating the parse tables. Currently, in CPython, a parser generatorprogram reads the grammar and produces a parsing table representing a set ofdeterministic finite automata (DFA) that can be included in a C program, theparser. The parser is a pushdown automaton that uses this data to produce a ConcreteSyntax Tree (CST) sometimes known directly as a “parse tree”. In this process, thefirst sets are used indirectly when generating the DFAs.
LL(1) parsers and grammars are usually efficient and simple to implementand generate. However, it is not possible, under the LL(1) restriction,to express certain common constructs in a way natural to the languagedesigner and the reader. This includes some in the Python language.
As LL(1) parsers can only look one token ahead to distinguishpossibilities, some rules in the grammar may be ambiguous. For instance the rule:
rule:A|B
is ambiguous if thefirst sets of bothA andB have some elements incommon. When the parser sees a token in the inputprogram that bothA andB can start with, it is impossible for it to deducewhich option to expand, as no further token of the program can be examined todisambiguate.The rule may be transformed to equivalent LL(1) rules, but then it maybe harder for a human reader to grasp its meaning.Examples later in this document show that the current LL(1)-basedgrammar suffers a lot from this scenario.
Another broad class of rules precluded by LL(1) is left-recursive rules.A rule is left-recursive if it can derive to asentential form with itself as the leftmost symbol. For instance this rule:
rule:rule'a'
is left-recursive because the rule can be expanded to an expression that startswith itself. As will be described later, left-recursion is the natural way toexpress certain desired language properties directly in the grammar.
A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar(like the current one) in the fact that the way it is written more closelyreflects how the parser will operate when parsing it. The fundamental technicaldifference is that the choice operator is ordered. This means that when writing:
rule:A|B|C
a context-free-grammar parser (like an LL(1) parser) will generate constructionsthat given an input string willdeduce which alternative (A,B orC)must be expanded, while a PEG parser will check if the first alternative succeedsand only if it fails, will it continue with the second or the third one in theorder in which they are written. This makes the choice operator not commutative.
Unlike LL(1) parsers, PEG-based parsers cannot be ambiguous: if a string parses,it has exactly one valid parse tree. This means that a PEG-based parser cannotsuffer from the ambiguity problems described in the previous section.
PEG parsers are usually constructed as a recursive descent parser in which everyrule in the grammar corresponds to a function in the program implementing theparser and the parsing expression (the “expansion” or “definition” of the rule)represents the “code” in said function. Each parsing function conceptually takesan input string as its argument, and yields one of the following results:
Notice that “failure” results do not imply that the program is incorrect or aparsing failure because as the choice operator is ordered, a “failure” resultmerely indicates “try the following option”. A direct implementation of a PEGparser as a recursive descent parser will present exponential time performance inthe worst case as compared with LL(1) parsers, because PEG parsers have infinite lookahead(this means that they can consider an arbitrary number of tokens before decidingfor a rule). Usually, PEG parsers avoid this exponential time complexity with atechnique called “packrat parsing”[1] which not only loads the entireprogram in memory before parsing it but also allows the parser to backtrackarbitrarily. This is made efficient by memoizing the rules already matched foreach position. The cost of the memoization cache is that the parser will naturallyuse more memory than a simple LL(1) parser, which normally are table-based. Wewill explain later in this document why we consider this cost acceptable.
In this section, we describe a list of problems that are present in the current parsermachinery in CPython that motivates the need for a new parser.
Although the Python grammar is technically an LL(1) grammar (because it is parsed byan LL(1) parser) several rules are not LL(1) and several workarounds areimplemented in the grammar and in other parts of CPython to deal with this. Forexample, consider the rule for assignment expressions:
namedexpr_test:[NAME ':=']test
This simple rule is not compatible with the Python grammar asNAME is among theelements of thefirst set of the ruletest. To work around this limitation theactual rule that appears in the current grammar is:
namedexpr_test:test[':=' test]
Which is a much broader rule than the previous one allowing constructs like[xforxiny]:=[1,2,3]. The way the rule is limited to its desired form is bydisallowing these unwanted constructions when transforming the parse tree to theabstract syntax tree. This is not only inelegant but a considerable maintenanceburden as it forces the AST creation routines and the compiler into a situation inwhich they need to know how to separate valid programs from invalid programs,which should be a responsibility solely of the parser. This also leads to theactual grammar file not reflecting correctly what theactual grammar is (thatis, the collection of all valid Python programs).
Similar workarounds appear in multiple other rules of the current grammar.Sometimes this problem is unsolvable. For instance,bpo-12782: Multiple contextexpressions do not support parentheses for continuation across lines shows how making an LL(1) rule that supportswriting:
with(open("a_really_long_foo")asfoo,open("a_really_long_baz")asbaz,open("a_really_long_bar")asbar):...
is not possible since the first sets of the grammar items that canappear as context managers include the open parenthesis, making the ruleambiguous. This rule is not only consistent with other parts of the language (likethe rule for multiple imports), but is also very useful to auto-formatting tools,as parenthesized groups are normally used to group elements to beformatted together (in the same way the tools operate on the contents of lists,sets…).
Another problem of the current parser is that there is a huge coupling between theAST generation routines and the particular shape of the produced parse trees. Thismakes the code for generating the AST especially complicated as many actions andchoices are implicit. For instance, the AST generation code knows whatalternatives of a certain rule are produced based on the number of child nodespresent in a given parse node. This makes the code difficult to follow as thisproperty is not directly related to the grammar file and is influenced byimplementation details. As a result of this, a considerable amount of the ASTgeneration code needs to deal with inspecting and reasoning about the particularshape of the parse trees that it receives.
As described previously, a limitation of LL(1) grammars is that they cannot allowleft-recursion. This makes writing some rules very unnatural and far from howprogrammers normally think about the program. For instance this construct (a simplervariation of several rules present in the current grammar):
expr:expr'+'term|term
cannot be parsed by an LL(1) parser. The traditional remedy is to rewrite thegrammar to circumvent the problem:
expr:term('+'term)*
The problem that appears with this form is that the parse tree is forced to have avery unnatural shape. This is because with this rule, for the input programa+b+c the parse tree will be flattened (['a','+','b','+','c']) and mustbe post-processed to construct a left-recursive parse tree ([['a','+','b'],'+','c']). Being forced to write the second rule not only leads to the parsetree not correctly reflecting the desired associativity, but also imposes furtherpressure on later compilation stages to detect and post-process these cases.
The last problem present in the current parser is the intermediate creation of aparse tree or Concrete Syntax Tree that is later transformed to an Abstract SyntaxTree. Although the construction of a CST is very common in parser and compilerpipelines, in CPython this intermediate CST is not used by anything else (it isonly indirectly exposed by theparser module and a surprisingly small part ofthe code in the CST production is reused in the module). Which is worse: the wholetree is kept in memory, keeping many branches that consist of chains of nodes witha single child. This has been shown to consume a considerable amount of memory (forinstance inbpo-26415: Excessive peak memory consumption by the Pythonparser).
Having to produce an intermediate result between the grammar and the AST is not onlyundesirable but also makes the AST generation step much more complicated, raisingconsiderably the maintenance burden.
The new proposed PEG parser contains the following pieces:
PEG parsers normally do not support left recursion but we have implemented atechnique similar to the one described in Medeiros et al.[2] but using thememoization cache instead of static variables. This approach is closer to the onedescribed in Warth et al.[3]. This allows us to write not only simple left-recursiverules but also more complicated rules that involve indirect left-recursion like:
rule1:rule2|'a'rule2:rule3|'b'rule3:rule1|'c'
and “hidden left-recursion” like:
rule:'optional'?rule'@'some_other_rule
The grammar consists of a sequence of rules of the form:
rule_name:expression
Optionally, a type can be included right after the rule name, whichspecifies the return type of the C or Python function corresponding tothe rule:
rule_name[return_type]:expression
If the return type is omitted, then avoid* is returned in C and anAny in Python.
#commentPython-style comments.
e1e2Match e1, then match e2.
rule_name:first_rulesecond_rule
e1|e2Match e1 or e2.
The first alternative can also appear on the line after the rule namefor formatting purposes. In that case, a | must be used before thefirst alternative, like so:
rule_name[return_type]:|first_alt|second_alt
(e)Match e.
rule_name:(e)
A slightly more complex and useful example includes using the groupingoperator together with the repeat operators:
rule_name:(e1e2)*
[e]ore?Optionally match e.
rule_name:[e]
A more useful example includes defining that a trailing comma isoptional:
rule_name:e(','e)*[',']
e*Match zero or more occurrences of e.
rule_name:(e1e2)*
e+Match one or more occurrences of e.
rule_name:(e1e2)+
s.e+Match one or more occurrences of e, separated by s. The generated parsetree does not include the separator. This is otherwise identical to(e(se)*).
rule_name:','.e+
&eSucceed if e can be parsed, without consuming any input.
!eFail if e can be parsed, without consuming any input.
An example taken from the proposed Python grammar specifies that a primaryconsists of an atom, which is not followed by a. or a( or a[:
primary:atom!'.'!'('!'['
~Commit to the current alternative, even if it fails to parse.
rule_name:'('~some_rule')'|some_alt
In this example, if a left parenthesis is parsed, then the otheralternative won’t be considered, even if some_rule or ‘)’ fail to beparsed.
A subexpression can be named by preceding it with an identifier and an= sign. The name can then be used in the action (see below), like this:
rule_name[return_type]:'('a=some_other_rule')'{a}
To avoid the intermediate steps that obscure the relationship between thegrammar and the AST generation the proposed PEG parser allows directlygenerating AST nodes for a rule via grammar actions. Grammar actions arelanguage-specific expressions that are evaluated when a grammar rule issuccessfully parsed. These expressions can be written in Python or Cdepending on the desired output of the parser generator. This means that ifone would want to generate a parser in Python and another in C, two grammarfiles should be written, each one with a different set of actions, keepingeverything else apart from said actions identical in both files. As anexample of a grammar with Python actions, the piece of the parser generatorthat parses grammar files is bootstrapped from a meta-grammar file withPython actions that generate the grammar tree as a result of the parsing.
In the specific case of the new proposed PEG grammar for Python, havingactions allows directly describing how the AST is composed in the grammaritself, making it more clear and maintainable. This AST generation process issupported by the use of some helper functions that factor out common ASTobject manipulations and some other required operations that are not directlyrelated to the grammar.
To indicate these actions each alternative can be followed by the action codeinside curly-braces, which specifies the return value of the alternative:
rule_name[return_type]:|first_alt1first_alt2{first_alt1}|second_alt1second_alt2{second_alt1}
If the action is omitted and C code is being generated, then there are twodifferent possibilities:
If the action is omitted and Python code is being generated, then a listwith all the parsed expressions gets returned (this is meant for debugging).
The full meta-grammar for the grammars supported by the PEG generator is:
start[Grammar]:grammarENDMARKER{grammar}grammar[Grammar]:|metasrules{Grammar(rules,metas)}|rules{Grammar(rules,[])}metas[MetaList]:|metametas{[meta]+metas}|meta{[meta]}meta[MetaTuple]:|"@"NAMENEWLINE{(name.string,None)}|"@"a=NAMEb=NAMENEWLINE{(a.string,b.string)}|"@"NAMESTRINGNEWLINE{(name.string,literal_eval(string.string))}rules[RuleList]:|rulerules{[rule]+rules}|rule{[rule]}rule[Rule]:|rulename":"altsNEWLINEINDENTmore_altsDEDENT{Rule(rulename[0],rulename[1],Rhs(alts.alts+more_alts.alts))}|rulename":"NEWLINEINDENTmore_altsDEDENT{Rule(rulename[0],rulename[1],more_alts)}|rulename":"altsNEWLINE{Rule(rulename[0],rulename[1],alts)}rulename[RuleName]:|NAME'['type=NAME'*'']'{(name.string,type.string+"*")}|NAME'['type=NAME']'{(name.string,type.string)}|NAME{(name.string,None)}alts[Rhs]:|alt"|"alts{Rhs([alt]+alts.alts)}|alt{Rhs([alt])}more_alts[Rhs]:|"|"altsNEWLINEmore_alts{Rhs(alts.alts+more_alts.alts)}|"|"altsNEWLINE{Rhs(alts.alts)}alt[Alt]:|items'$'action{Alt(items+[NamedItem(None, NameLeaf('ENDMARKER'))],action=action)}|items'$'{Alt(items+[NamedItem(None, NameLeaf('ENDMARKER'))],action=None)}|itemsaction{Alt(items,action=action)}|items{Alt(items,action=None)}items[NamedItemList]:|named_itemitems{[named_item]+items}|named_item{[named_item]}named_item[NamedItem]:|NAME'='~item{NamedItem(name.string,item)}|item{NamedItem(None,item)}|it=lookahead{NamedItem(None,it)}lookahead[LookaheadOrCut]:|'&'~atom{PositiveLookahead(atom)}|'!'~atom{NegativeLookahead(atom)}|'~'{Cut()}item[Item]:|'['~alts']'{Opt(alts)}|atom'?'{Opt(atom)}|atom'*'{Repeat0(atom)}|atom'+'{Repeat1(atom)}|sep=atom'.'node=atom'+'{Gather(sep,node)}|atom{atom}atom[Plain]:|'('~alts')'{Group(alts)}|NAME{NameLeaf(name.string)}|STRING{StringLeaf(string.string)}# Mini-grammar for the actionsaction[str]:"{"~target_atoms"}"{target_atoms}target_atoms[str]:|target_atomtarget_atoms{target_atom+" "+target_atoms}|target_atom{target_atom}target_atom[str]:|"{"~target_atoms"}"{"{"+target_atoms+"}"}|NAME{name.string}|NUMBER{number.string}|STRING{string.string}|"?"{"?"}|":"{":"}
As an illustrative example this simple grammar file allows directlygenerating a full parser that can parse simple arithmetic expressions and thatreturns a valid C-based Python AST:
start[mod_ty]:a=expr_stmt*${Module(a,NULL,p->arena)}expr_stmt[stmt_ty]:a=exprNEWLINE{_Py_Expr(a,EXTRA)}expr[expr_ty]:|l=expr'+'r=term{_Py_BinOp(l,Add,r,EXTRA)}|l=expr'-'r=term{_Py_BinOp(l,Sub,r,EXTRA)}|t=term{t}term[expr_ty]:|l=term'*'r=factor{_Py_BinOp(l,Mult,r,EXTRA)}|l=term'/'r=factor{_Py_BinOp(l,Div,r,EXTRA)}|f=factor{f}factor[expr_ty]:|'('e=expr')'{e}|a=atom{a}atom[expr_ty]:|n=NAME{n}|n=NUMBER{n}|s=STRING{s}
HereEXTRA is a macro that expands tostart_lineno,start_col_offset,end_lineno,end_col_offset,p->arena, those being variables automaticallyinjected by the parser;p points to an object that holds on to all statefor the parser.
A similar grammar written to target Python AST objects:
start:exprNEWLINE?ENDMARKER{ast.Expression(expr)}expr:|expr'+'term{ast.BinOp(expr,ast.Add(),term)}|expr'-'term{ast.BinOp(expr,ast.Sub(),term)}|term{term}term:|l=term'*'r=factor{ast.BinOp(l,ast.Mult(),r)}|term'/'factor{ast.BinOp(term,ast.Div(),factor)}|factor{factor}factor:|'('expr')'{expr}|atom{atom}atom:|NAME{ast.Name(id=name.string,ctx=ast.Load())}|NUMBER{ast.Constant(value=ast.literal_eval(number.string))}
This section describes the migration plan when porting to the new PEG-based parserif this PEP is accepted. The migration will be executed in a series of steps that allowinitially to fallback to the previous parser if needed:
ast.parseandcompile will use the parser set by the flags or the environment variable andthe default parser will be the new PEG-based parser.We have done extensive timing and validation of the new parser, andthis gives us confidence that the new parser is of high enough qualityto replace the current parser.
To start with validation, we regularly compile the entire Python 3.8stdlib and compare every aspect of the resulting AST with thatproduced by the standard compiler. (In the process we found a few bugsin the standard parser’s treatment of line and column numbers, whichwe have all fixed upstream via a series of issues and PRs.)
We have also occasionally compiled a much larger codebase (the approx.3800 most popular packages on PyPI) and this has helped us find a (very)few additional bugs in the new parser.
(One area we have not explored extensively is rejection of all wrongprograms. We have unit tests that check for a certain number ofexplicit rejections, but more work could be done, e.g. by using afuzzer that inserts random subtle bugs into existing code. We’re opento help in this area.)
We have tuned the performance of the new parser to come within 10% ofthe current parser both in speed and memory consumption. While thePEG/packrat parsing algorithm inherently consumes more memory than thecurrent LL(1) parser, we have an advantage because we don’t constructan intermediate CST.
Below are some benchmarks. These are focused on compiling source codeto bytecode, because this is the most realistic situation. Returningan AST to Python code is not as representative, because the process toconvert theinternal AST (only accessible to C code) to anexternal AST (an instance ofast.AST) takes more time than theparser itself.
All measurements reported here are done on a recent MacBook Pro,taking the median of three runs. No particular care was taken to stopother applications running on the same machine.
The first timings are for our canonical test file, which has 100,000lines endlessly repeating the following three lines:
1+2+4+5+6+7+8+9+10+((((((11*12*13*14*15+16*17+18*19*20))))))2*3+4*5*612+(2*3*4*5+6+7*8)
ast.AST takes 6.34 seconds, max RSS1029 MiB.For this particular test file, the new parser is faster and uses lessmemory than the current parser (compare the last two bullets).
We also did timings with a more realistic payload, the entire Python3.8 stdlib. This payload consists of 1,641 files, 749,570 lines,27,622,497 bytes. (Though 11 files can’t be compiled by any Python 3parser due to encoding issues, sometimes intentional.)
Comparing the last two bullets we find that the new parser is slightlyfaster but uses slightly (about 10%) more memory. We believe this isacceptable. (Also, there are probably some more tweaks we can make toreduce memory usage.)
We did not seriously consider alternative ways to implement the newparser, but here’s a brief discussion of LALR(1).
Thirty years ago the first author decided to go his own way withPython’s parser rather than using LALR(1), which was the industrystandard at the time (e.g. Bison and Yacc). The reasons wereprimarily emotional (gut feelings, intuition), based on past experienceusing Yacc in other projects, where grammar development took moreeffort than anticipated (in part due to shift-reduce conflicts). Aspecific criticism of Bison and Yacc that still holds is that theirmeta-grammar (the notation used to feed the grammar into the parsergenerator) does not support EBNF conveniences like[optional_clause] or(repeated_clause)*. Using a customparser generator, a syntax tree matching the structure of the grammarcould be generated automatically, and with EBNF that tree could matchthe “human-friendly” structure of the grammar.
Other variants of LR were not considered, nor was LL (e.g. ANTLR).PEG was selected because it was easy to understand given a basicunderstanding of recursive-descent parsing.
[4] Guido’s series on PEG parsinghttps://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-0617.rst
Last modified:2025-11-07 04:32:09 GMT