Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 617 – New PEG parser for CPython

Author:
Guido van Rossum <guido at python.org>,Pablo Galindo Salgado <pablogsal at python.org>,Lysandros Nikolaou <lisandrosnik at gmail.com>
Discussions-To:
Python-Dev list
Status:
Final
Type:
Standards Track
Created:
24-Mar-2020
Python-Version:
3.9
Post-History:
02-Apr-2020

Table of Contents

Important

This PEP is a historical document. The up-to-date, canonical documentation can now be found atFull Grammar specification.

×

SeePEP 1 for how to propose changes.

Overview

This PEP proposes replacing the current LL(1)-based parser of CPythonwith a new PEG-based parser. This new parser would allow the elimination of multiple“hacks” that exist in the current grammar to circumvent the LL(1)-limitation.It would substantially reduce the maintenance costs in some areas related to thecompiling pipeline such as the grammar, the parser and the AST generation. The new PEGparser will also lift the LL(1) restriction on the current Python grammar.

Background on LL(1) parsers

The current Python grammar is an LL(1)-based grammar. A grammar can be said to beLL(1) if it can be parsed by an LL(1) parser, which in turn is defined as atop-down parser that parses the input from left to right, performing leftmostderivation of the sentence, with just one token of lookahead.The traditional approach to constructing or generating an LL(1) parser is toproduce aparse table which encodes the possible transitions between all possiblestates of the parser. These tables are normally constructed from thefirst setsand thefollow sets of the grammar:

  • Given a rule, thefirst set is the collection of all terminals that can occurfirst in a full derivation of that rule. Intuitively, this helps the parser decideamong the alternatives in a rule. Forinstance, given the rule:
    rule:A|B

    if onlyA can start with the terminala and onlyB can start with theterminalb and the parser sees the tokenb when parsing this rule, it knowsthat it needs to follow the non-terminalB.

  • An extension to this simple idea is needed when a rule may expand to the empty string.Given a rule, thefollow set is the collection of terminals that can appearimmediately to the right of that rule in a partial derivation. Intuitively, thissolves the problem of the empty alternative. For instance,given this rule:
    rule:A'b'

    if the parser has the tokenb and the non-terminalA can only startwith the tokena, then the parser can tell that this is an invalid program.But ifA could expand to the empty string (called an ε-production),then the parser would recognise a valid emptyA,since the next tokenb is in thefollow set ofA.

The current Python grammar does not contain ε-productions, so thefollow sets are notneeded when creating the parse tables. Currently, in CPython, a parser generatorprogram reads the grammar and produces a parsing table representing a set ofdeterministic finite automata (DFA) that can be included in a C program, theparser. The parser is a pushdown automaton that uses this data to produce a ConcreteSyntax Tree (CST) sometimes known directly as a “parse tree”. In this process, thefirst sets are used indirectly when generating the DFAs.

LL(1) parsers and grammars are usually efficient and simple to implementand generate. However, it is not possible, under the LL(1) restriction,to express certain common constructs in a way natural to the languagedesigner and the reader. This includes some in the Python language.

As LL(1) parsers can only look one token ahead to distinguishpossibilities, some rules in the grammar may be ambiguous. For instance the rule:

rule:A|B

is ambiguous if thefirst sets of bothA andB have some elements incommon. When the parser sees a token in the inputprogram that bothA andB can start with, it is impossible for it to deducewhich option to expand, as no further token of the program can be examined todisambiguate.The rule may be transformed to equivalent LL(1) rules, but then it maybe harder for a human reader to grasp its meaning.Examples later in this document show that the current LL(1)-basedgrammar suffers a lot from this scenario.

Another broad class of rules precluded by LL(1) is left-recursive rules.A rule is left-recursive if it can derive to asentential form with itself as the leftmost symbol. For instance this rule:

rule:rule'a'

is left-recursive because the rule can be expanded to an expression that startswith itself. As will be described later, left-recursion is the natural way toexpress certain desired language properties directly in the grammar.

Background on PEG parsers

A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar(like the current one) in the fact that the way it is written more closelyreflects how the parser will operate when parsing it. The fundamental technicaldifference is that the choice operator is ordered. This means that when writing:

rule:A|B|C

a context-free-grammar parser (like an LL(1) parser) will generate constructionsthat given an input string willdeduce which alternative (A,B orC)must be expanded, while a PEG parser will check if the first alternative succeedsand only if it fails, will it continue with the second or the third one in theorder in which they are written. This makes the choice operator not commutative.

Unlike LL(1) parsers, PEG-based parsers cannot be ambiguous: if a string parses,it has exactly one valid parse tree. This means that a PEG-based parser cannotsuffer from the ambiguity problems described in the previous section.

PEG parsers are usually constructed as a recursive descent parser in which everyrule in the grammar corresponds to a function in the program implementing theparser and the parsing expression (the “expansion” or “definition” of the rule)represents the “code” in said function. Each parsing function conceptually takesan input string as its argument, and yields one of the following results:

  • A “success” result. This result indicates that the expression can be parsed bythat rule and the function may optionally move forward or consume one or morecharacters of the input string supplied to it.
  • A “failure” result, in which case no input is consumed.

Notice that “failure” results do not imply that the program is incorrect or aparsing failure because as the choice operator is ordered, a “failure” resultmerely indicates “try the following option”. A direct implementation of a PEGparser as a recursive descent parser will present exponential time performance inthe worst case as compared with LL(1) parsers, because PEG parsers have infinite lookahead(this means that they can consider an arbitrary number of tokens before decidingfor a rule). Usually, PEG parsers avoid this exponential time complexity with atechnique called “packrat parsing”[1] which not only loads the entireprogram in memory before parsing it but also allows the parser to backtrackarbitrarily. This is made efficient by memoizing the rules already matched foreach position. The cost of the memoization cache is that the parser will naturallyuse more memory than a simple LL(1) parser, which normally are table-based. Wewill explain later in this document why we consider this cost acceptable.

Rationale

In this section, we describe a list of problems that are present in the current parsermachinery in CPython that motivates the need for a new parser.

Some rules are not actually LL(1)

Although the Python grammar is technically an LL(1) grammar (because it is parsed byan LL(1) parser) several rules are not LL(1) and several workarounds areimplemented in the grammar and in other parts of CPython to deal with this. Forexample, consider the rule for assignment expressions:

namedexpr_test:[NAME ':=']test

This simple rule is not compatible with the Python grammar asNAME is among theelements of thefirst set of the ruletest. To work around this limitation theactual rule that appears in the current grammar is:

namedexpr_test:test[':=' test]

Which is a much broader rule than the previous one allowing constructs like[xforxiny]:=[1,2,3]. The way the rule is limited to its desired form is bydisallowing these unwanted constructions when transforming the parse tree to theabstract syntax tree. This is not only inelegant but a considerable maintenanceburden as it forces the AST creation routines and the compiler into a situation inwhich they need to know how to separate valid programs from invalid programs,which should be a responsibility solely of the parser. This also leads to theactual grammar file not reflecting correctly what theactual grammar is (thatis, the collection of all valid Python programs).

Similar workarounds appear in multiple other rules of the current grammar.Sometimes this problem is unsolvable. For instance,bpo-12782: Multiple contextexpressions do not support parentheses for continuation across lines shows how making an LL(1) rule that supportswriting:

with(open("a_really_long_foo")asfoo,open("a_really_long_baz")asbaz,open("a_really_long_bar")asbar):...

is not possible since the first sets of the grammar items that canappear as context managers include the open parenthesis, making the ruleambiguous. This rule is not only consistent with other parts of the language (likethe rule for multiple imports), but is also very useful to auto-formatting tools,as parenthesized groups are normally used to group elements to beformatted together (in the same way the tools operate on the contents of lists,sets…).

Complicated AST parsing

Another problem of the current parser is that there is a huge coupling between theAST generation routines and the particular shape of the produced parse trees. Thismakes the code for generating the AST especially complicated as many actions andchoices are implicit. For instance, the AST generation code knows whatalternatives of a certain rule are produced based on the number of child nodespresent in a given parse node. This makes the code difficult to follow as thisproperty is not directly related to the grammar file and is influenced byimplementation details. As a result of this, a considerable amount of the ASTgeneration code needs to deal with inspecting and reasoning about the particularshape of the parse trees that it receives.

Lack of left recursion

As described previously, a limitation of LL(1) grammars is that they cannot allowleft-recursion. This makes writing some rules very unnatural and far from howprogrammers normally think about the program. For instance this construct (a simplervariation of several rules present in the current grammar):

expr:expr'+'term|term

cannot be parsed by an LL(1) parser. The traditional remedy is to rewrite thegrammar to circumvent the problem:

expr:term('+'term)*

The problem that appears with this form is that the parse tree is forced to have avery unnatural shape. This is because with this rule, for the input programa+b+c the parse tree will be flattened (['a','+','b','+','c']) and mustbe post-processed to construct a left-recursive parse tree ([['a','+','b'],'+','c']). Being forced to write the second rule not only leads to the parsetree not correctly reflecting the desired associativity, but also imposes furtherpressure on later compilation stages to detect and post-process these cases.

Intermediate parse tree

The last problem present in the current parser is the intermediate creation of aparse tree or Concrete Syntax Tree that is later transformed to an Abstract SyntaxTree. Although the construction of a CST is very common in parser and compilerpipelines, in CPython this intermediate CST is not used by anything else (it isonly indirectly exposed by theparser module and a surprisingly small part ofthe code in the CST production is reused in the module). Which is worse: the wholetree is kept in memory, keeping many branches that consist of chains of nodes witha single child. This has been shown to consume a considerable amount of memory (forinstance inbpo-26415: Excessive peak memory consumption by the Pythonparser).

Having to produce an intermediate result between the grammar and the AST is not onlyundesirable but also makes the AST generation step much more complicated, raisingconsiderably the maintenance burden.

The new proposed PEG parser

The new proposed PEG parser contains the following pieces:

  • A parser generator that can read a grammar file and produce a PEG parserwritten in Python or C that can parse said grammar.
  • A PEG meta-grammar that automatically generates a Python parser that is usedfor the parser generator itself (this means that there are no manually-writtenparsers).
  • A generated parser (using the parser generator) that can directly produce C andPython AST objects.

Left recursion

PEG parsers normally do not support left recursion but we have implemented atechnique similar to the one described in Medeiros et al.[2] but using thememoization cache instead of static variables. This approach is closer to the onedescribed in Warth et al.[3]. This allows us to write not only simple left-recursiverules but also more complicated rules that involve indirect left-recursion like:

rule1:rule2|'a'rule2:rule3|'b'rule3:rule1|'c'

and “hidden left-recursion” like:

rule:'optional'?rule'@'some_other_rule

Syntax

The grammar consists of a sequence of rules of the form:

rule_name:expression

Optionally, a type can be included right after the rule name, whichspecifies the return type of the C or Python function corresponding tothe rule:

rule_name[return_type]:expression

If the return type is omitted, then avoid* is returned in C and anAny in Python.

Grammar Expressions

#comment

Python-style comments.

e1e2

Match e1, then match e2.

rule_name:first_rulesecond_rule
e1|e2

Match e1 or e2.

The first alternative can also appear on the line after the rule namefor formatting purposes. In that case, a | must be used before thefirst alternative, like so:

rule_name[return_type]:|first_alt|second_alt
(e)

Match e.

rule_name:(e)

A slightly more complex and useful example includes using the groupingoperator together with the repeat operators:

rule_name:(e1e2)*
[e]ore?

Optionally match e.

rule_name:[e]

A more useful example includes defining that a trailing comma isoptional:

rule_name:e(','e)*[',']
e*

Match zero or more occurrences of e.

rule_name:(e1e2)*
e+

Match one or more occurrences of e.

rule_name:(e1e2)+
s.e+

Match one or more occurrences of e, separated by s. The generated parsetree does not include the separator. This is otherwise identical to(e(se)*).

rule_name:','.e+
&e

Succeed if e can be parsed, without consuming any input.

!e

Fail if e can be parsed, without consuming any input.

An example taken from the proposed Python grammar specifies that a primaryconsists of an atom, which is not followed by a. or a( or a[:

primary:atom!'.'!'('!'['
~

Commit to the current alternative, even if it fails to parse.

rule_name:'('~some_rule')'|some_alt

In this example, if a left parenthesis is parsed, then the otheralternative won’t be considered, even if some_rule or ‘)’ fail to beparsed.

Variables in the Grammar

A subexpression can be named by preceding it with an identifier and an= sign. The name can then be used in the action (see below), like this:

rule_name[return_type]:'('a=some_other_rule')'{a}

Grammar actions

To avoid the intermediate steps that obscure the relationship between thegrammar and the AST generation the proposed PEG parser allows directlygenerating AST nodes for a rule via grammar actions. Grammar actions arelanguage-specific expressions that are evaluated when a grammar rule issuccessfully parsed. These expressions can be written in Python or Cdepending on the desired output of the parser generator. This means that ifone would want to generate a parser in Python and another in C, two grammarfiles should be written, each one with a different set of actions, keepingeverything else apart from said actions identical in both files. As anexample of a grammar with Python actions, the piece of the parser generatorthat parses grammar files is bootstrapped from a meta-grammar file withPython actions that generate the grammar tree as a result of the parsing.

In the specific case of the new proposed PEG grammar for Python, havingactions allows directly describing how the AST is composed in the grammaritself, making it more clear and maintainable. This AST generation process issupported by the use of some helper functions that factor out common ASTobject manipulations and some other required operations that are not directlyrelated to the grammar.

To indicate these actions each alternative can be followed by the action codeinside curly-braces, which specifies the return value of the alternative:

rule_name[return_type]:|first_alt1first_alt2{first_alt1}|second_alt1second_alt2{second_alt1}

If the action is omitted and C code is being generated, then there are twodifferent possibilities:

  1. If there’s a single name in the alternative, this gets returned.
  2. If not, a dummy name object gets returned (this case should be avoided).

If the action is omitted and Python code is being generated, then a listwith all the parsed expressions gets returned (this is meant for debugging).

The full meta-grammar for the grammars supported by the PEG generator is:

start[Grammar]:grammarENDMARKER{grammar}grammar[Grammar]:|metasrules{Grammar(rules,metas)}|rules{Grammar(rules,[])}metas[MetaList]:|metametas{[meta]+metas}|meta{[meta]}meta[MetaTuple]:|"@"NAMENEWLINE{(name.string,None)}|"@"a=NAMEb=NAMENEWLINE{(a.string,b.string)}|"@"NAMESTRINGNEWLINE{(name.string,literal_eval(string.string))}rules[RuleList]:|rulerules{[rule]+rules}|rule{[rule]}rule[Rule]:|rulename":"altsNEWLINEINDENTmore_altsDEDENT{Rule(rulename[0],rulename[1],Rhs(alts.alts+more_alts.alts))}|rulename":"NEWLINEINDENTmore_altsDEDENT{Rule(rulename[0],rulename[1],more_alts)}|rulename":"altsNEWLINE{Rule(rulename[0],rulename[1],alts)}rulename[RuleName]:|NAME'['type=NAME'*'']'{(name.string,type.string+"*")}|NAME'['type=NAME']'{(name.string,type.string)}|NAME{(name.string,None)}alts[Rhs]:|alt"|"alts{Rhs([alt]+alts.alts)}|alt{Rhs([alt])}more_alts[Rhs]:|"|"altsNEWLINEmore_alts{Rhs(alts.alts+more_alts.alts)}|"|"altsNEWLINE{Rhs(alts.alts)}alt[Alt]:|items'$'action{Alt(items+[NamedItem(None, NameLeaf('ENDMARKER'))],action=action)}|items'$'{Alt(items+[NamedItem(None, NameLeaf('ENDMARKER'))],action=None)}|itemsaction{Alt(items,action=action)}|items{Alt(items,action=None)}items[NamedItemList]:|named_itemitems{[named_item]+items}|named_item{[named_item]}named_item[NamedItem]:|NAME'='~item{NamedItem(name.string,item)}|item{NamedItem(None,item)}|it=lookahead{NamedItem(None,it)}lookahead[LookaheadOrCut]:|'&'~atom{PositiveLookahead(atom)}|'!'~atom{NegativeLookahead(atom)}|'~'{Cut()}item[Item]:|'['~alts']'{Opt(alts)}|atom'?'{Opt(atom)}|atom'*'{Repeat0(atom)}|atom'+'{Repeat1(atom)}|sep=atom'.'node=atom'+'{Gather(sep,node)}|atom{atom}atom[Plain]:|'('~alts')'{Group(alts)}|NAME{NameLeaf(name.string)}|STRING{StringLeaf(string.string)}# Mini-grammar for the actionsaction[str]:"{"~target_atoms"}"{target_atoms}target_atoms[str]:|target_atomtarget_atoms{target_atom+" "+target_atoms}|target_atom{target_atom}target_atom[str]:|"{"~target_atoms"}"{"{"+target_atoms+"}"}|NAME{name.string}|NUMBER{number.string}|STRING{string.string}|"?"{"?"}|":"{":"}

As an illustrative example this simple grammar file allows directlygenerating a full parser that can parse simple arithmetic expressions and thatreturns a valid C-based Python AST:

start[mod_ty]:a=expr_stmt*${Module(a,NULL,p->arena)}expr_stmt[stmt_ty]:a=exprNEWLINE{_Py_Expr(a,EXTRA)}expr[expr_ty]:|l=expr'+'r=term{_Py_BinOp(l,Add,r,EXTRA)}|l=expr'-'r=term{_Py_BinOp(l,Sub,r,EXTRA)}|t=term{t}term[expr_ty]:|l=term'*'r=factor{_Py_BinOp(l,Mult,r,EXTRA)}|l=term'/'r=factor{_Py_BinOp(l,Div,r,EXTRA)}|f=factor{f}factor[expr_ty]:|'('e=expr')'{e}|a=atom{a}atom[expr_ty]:|n=NAME{n}|n=NUMBER{n}|s=STRING{s}

HereEXTRA is a macro that expands tostart_lineno,start_col_offset,end_lineno,end_col_offset,p->arena, those being variables automaticallyinjected by the parser;p points to an object that holds on to all statefor the parser.

A similar grammar written to target Python AST objects:

start:exprNEWLINE?ENDMARKER{ast.Expression(expr)}expr:|expr'+'term{ast.BinOp(expr,ast.Add(),term)}|expr'-'term{ast.BinOp(expr,ast.Sub(),term)}|term{term}term:|l=term'*'r=factor{ast.BinOp(l,ast.Mult(),r)}|term'/'factor{ast.BinOp(term,ast.Div(),factor)}|factor{factor}factor:|'('expr')'{expr}|atom{atom}atom:|NAME{ast.Name(id=name.string,ctx=ast.Load())}|NUMBER{ast.Constant(value=ast.literal_eval(number.string))}

Migration plan

This section describes the migration plan when porting to the new PEG-based parserif this PEP is accepted. The migration will be executed in a series of steps that allowinitially to fallback to the previous parser if needed:

  1. Starting with Python 3.9 alpha 6, include the new PEG-based parser machinery in CPythonwith a command-line flag and environment variable that allows switching betweenthe new and the old parsers together with explicit APIs that allow invoking thenew and the old parsers independently. At this step, all Python APIs likeast.parseandcompile will use the parser set by the flags or the environment variable andthe default parser will be the new PEG-based parser.
  2. Between Python 3.9 and Python 3.10, the old parser and related code (like the“parser” module) will be kept until a new Python release happens (Python 3.10). Inthe meanwhile and until the old parser is removed,no new Python Grammaraddition will be added that requires the PEG parser. This means that the grammarwill be kept LL(1) until the old parser is removed.
  3. In Python 3.10, remove the old parser, the command-line flag, the environmentvariable and the “parser” module and related code.

Performance and validation

We have done extensive timing and validation of the new parser, andthis gives us confidence that the new parser is of high enough qualityto replace the current parser.

Validation

To start with validation, we regularly compile the entire Python 3.8stdlib and compare every aspect of the resulting AST with thatproduced by the standard compiler. (In the process we found a few bugsin the standard parser’s treatment of line and column numbers, whichwe have all fixed upstream via a series of issues and PRs.)

We have also occasionally compiled a much larger codebase (the approx.3800 most popular packages on PyPI) and this has helped us find a (very)few additional bugs in the new parser.

(One area we have not explored extensively is rejection of all wrongprograms. We have unit tests that check for a certain number ofexplicit rejections, but more work could be done, e.g. by using afuzzer that inserts random subtle bugs into existing code. We’re opento help in this area.)

Performance

We have tuned the performance of the new parser to come within 10% ofthe current parser both in speed and memory consumption. While thePEG/packrat parsing algorithm inherently consumes more memory than thecurrent LL(1) parser, we have an advantage because we don’t constructan intermediate CST.

Below are some benchmarks. These are focused on compiling source codeto bytecode, because this is the most realistic situation. Returningan AST to Python code is not as representative, because the process toconvert theinternal AST (only accessible to C code) to anexternal AST (an instance ofast.AST) takes more time than theparser itself.

All measurements reported here are done on a recent MacBook Pro,taking the median of three runs. No particular care was taken to stopother applications running on the same machine.

The first timings are for our canonical test file, which has 100,000lines endlessly repeating the following three lines:

1+2+4+5+6+7+8+9+10+((((((11*12*13*14*15+16*17+18*19*20))))))2*3+4*5*612+(2*3*4*5+6+7*8)
  • Just parsing and throwing away the internal AST takes 1.16 secondswith a max RSS of 681 MiB.
  • Parsing and converting toast.AST takes 6.34 seconds, max RSS1029 MiB.
  • Parsing and compiling to bytecode takes 1.28 seconds, max RSS 681MiB.
  • With the current parser, parsing and compiling takes 1.44 seconds,max RSS 836 MiB.

For this particular test file, the new parser is faster and uses lessmemory than the current parser (compare the last two bullets).

We also did timings with a more realistic payload, the entire Python3.8 stdlib. This payload consists of 1,641 files, 749,570 lines,27,622,497 bytes. (Though 11 files can’t be compiled by any Python 3parser due to encoding issues, sometimes intentional.)

  • Compiling and throwing away the internal AST took 2.141 seconds.That’s 350,040 lines/sec, or 12,899,367 bytes/sec. The max RSS was74 MiB (the largest file in the stdlib is much smaller than ourcanonical test file).
  • Compiling to bytecode took 3.290 seconds. That’s 227,861 lines/sec,or 8,396,942 bytes/sec. Max RSS 77 MiB.
  • Compiling to bytecode using the current parser took 3.367 seconds.That’s 222,620 lines/sec, or 8,203,780 bytes/sec. Max RSS 70 MiB.

Comparing the last two bullets we find that the new parser is slightlyfaster but uses slightly (about 10%) more memory. We believe this isacceptable. (Also, there are probably some more tweaks we can make toreduce memory usage.)

Rejected Alternatives

We did not seriously consider alternative ways to implement the newparser, but here’s a brief discussion of LALR(1).

Thirty years ago the first author decided to go his own way withPython’s parser rather than using LALR(1), which was the industrystandard at the time (e.g. Bison and Yacc). The reasons wereprimarily emotional (gut feelings, intuition), based on past experienceusing Yacc in other projects, where grammar development took moreeffort than anticipated (in part due to shift-reduce conflicts). Aspecific criticism of Bison and Yacc that still holds is that theirmeta-grammar (the notation used to feed the grammar into the parsergenerator) does not support EBNF conveniences like[optional_clause] or(repeated_clause)*. Using a customparser generator, a syntax tree matching the structure of the grammarcould be generated automatically, and with EBNF that tree could matchthe “human-friendly” structure of the grammar.

Other variants of LR were not considered, nor was LL (e.g. ANTLR).PEG was selected because it was easy to understand given a basicunderstanding of recursive-descent parsing.

References

[1]
Ford, Bryanhttps://pdos.csail.mit.edu/~baford/packrat/thesis/
[2]
Medeiros et al.https://arxiv.org/pdf/1207.0443.pdf
[3]
Warth et al.https://web.cs.ucla.edu/~todd/research/pepm08.pdf

[4] Guido’s series on PEG parsinghttps://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60

Copyright

This document has been placed in the public domain.


Source:https://github.com/python/peps/blob/main/peps/pep-0617.rst

Last modified:2025-11-07 04:32:09 GMT


[8]ページ先頭

©2009-2025 Movatter.jp