NotificationsYou must be signed in to change notification settings
Fork298
Star2.4k

Design Notes

Paul McGuire edited this pageDec 12, 2022 ·2 revisions

What is Group for?

When building up a pyparsing parser, the resulting strings are returned as aParseResults object, with the parsed strings in a flat sequence. This is the case evenif you build up the parser from sub-expressions, and add them into a larger expressionusing '+' and '|' operators.

Imagine we are writing a parser for a simplified form of a Pythondef statement. Here isa BNF:

def_stmt ::= 'def' identifier '(' arg_spec, ... ')' -> return_type_spec ':'arg_spec ::= identifierreturn_type_spec ::= identifieridentifier ::= <normal Python identifier>

This translates almost directly topyparsing (working from bottom to top):

import pyparsing as ppppc = pp.common# define punctuation and keywordsLPAR, RPAR, COLON = map(pp.Suppress, "():")ARROW = pp.Suppress("->")DEF = pp.Keyword("def").suppress()identifier = ppc.identifierarg_spec = identifier()return_type_spec = identifier()def_stmt = (DEF + identifier            + LPAR + pp.Opt(pp.delimitedList(arg_spec)) + RPAR            + ARROW + return_type_spec + COLON)

We can test this parser using therun_tests() method:

def_stmt.run_tests("""\    def sin() -> float:    def abs(x) -> float:    def pad(s, min, max) -> str:    """)

And we get our parsed results:

def sin() -> float:['sin', 'float']def abs(x) -> float:['abs', 'x', 'float']def pad(s, min, max) -> str:['pad', 's', 'min', 'max', 'str']

This isokay, we can certainly parcel this out into function name, arguments, and return type.But now imagine we add support for optional types for the arguments as well:

arg_spec ::= identfier [':' type_spec]type_spec ::= identfier

So thatpads definition could be expanded todef pad(s: str, min: int, max:int) -> str:.We accomplish this just be changing the definition of arg_spec to:

type_spec = identifier()arg_spec = identifier() + pp.Opt(COLON + type_spec)

And parsing the type annotated definition of pad now gives:

def pad(s: str, min: int, max: int) -> str:['pad', 's', 'str', 'min', 'int', 'max', 'int', 'str']

SUCCESS! Well, sort of. We can see that we've parsed all the individual parts. But theresults are still just a sequence of the found strings. We could remove the suppressing ofall the punctuation, and get the separating()'s and,'s, but then we will have done littlemore than tokenize the statement, and now we need logic to walk all the tokens.

Also, since the types are optional, we can't reliably walk this list of strings, assumingthat after the function name will come the name/type pairs for each argument - some argumentsmight have omitted the type.

This is whereGroup comes in. If we enclosearg_spec in aGroup, this will define name/typestructures for thedef statement. Changingarg_spec to:

arg_spec = pp.Group(identifier() + pp.Opt(COLON + type_spec))

With this change, the parser gives us aParseResults containing:

['pad', ['s', 'str'], ['min', 'int'], ['max', 'int'], 'str']

Now that we can group the results using Group, we can go further and gather allthe parsed argument specs into a sub-structure, so we don't have to unpack theresults by grabbing leading name and trailing return type, and assuming the rest arethe arguments.

So next we change thedef statement itself to group the optional arguments:

def_stmt = (DEF + identifier            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec))) + RPAR            + ARROW + return_type_spec + COLON)

Giving us this structured result:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']

If we run this parser against our original list of function definitions, we willsee another benefit:every function always returns exactly 3 elements. Even thosethat have no defined arguments, such as the one forsin, will include an emptylist for the second element.

And lastly, to make it easier to access the individual parts of the results,we'll add results names to the grammar definition:

def_stmt = (DEF + identifier("function_name")            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR            + ARROW + return_type_spec("returns") + COLON)

And we get:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]  [0]:    ['s', 'str']  [1]:    ['min', 'int']  [2]:    ['max', 'int']- function_name: 'pad'- returns: 'str'

You can still access the contents of theParseResults as if they were items in a list,but you can also get to them by name:

result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")print(result[0])             # prints 'pad'print(result.function_name)  # also prints 'pad'

We should do the same thing for the parts of each argument spec:

arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))

To get:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]  [0]:    ['s', 'str']    - name: 's'    - type: 'str'  [1]:    ['min', 'int']    - name: 'min'    - type: 'int'  [2]:    ['max', 'int']    - name: 'max'    - type: 'int'- function_name: 'pad'- returns: 'str'

So we could then write:

for arg in result.args:    print(arg.name, arg.type)

Group has done one more thing for us. Note that all the grouped argument specs have maintainedtheir separate argument names and types. If we remove theGroup() enclosingarg_spec, this is the kindof result we would get:

['pad', ['s', 'str', 'min', 'int', 'max', 'int'], 'str']- args: ['s', 'str', 'min', 'int', 'max', 'int']  - name: 'max'  - type: 'int'- function_name: 'pad'- returns: 'str'

Saving results names into aParseResults is very similar to assigning key-values to a Pythondict: if duplicated keys are used, only the last assigned value is stored. When the arg_specsare ungrouped, the name and type only have the values for the finalmax argument.So by grouping the argument specs, we maintain the separate names and types for each argumentin the list of arguments.

In summary,Group serves two purposes:

it adds structure
it scopes results names

By usingGroup and results names, your parser will generate results that are much easierfor your post-parsing code to process.

Here are some ideas for other enhancements to this parser:

add an optional section after the return type to list possible exceptions that couldbe raised (similar to "throws" clause in Java method definitions)
expandarg_type to handle multiple types, such asdef abs(x: int|float) -> float:

Here is the complete program for this parser:

import pyparsing as ppppc = pp.common# define punctuation and keywords (they get suppressed because they are only# needed at parse time, they don't contain useful information for post-parse work)LPAR, RPAR, COLON = map(pp.Suppress, "():")ARROW = pp.Suppress("->")DEF = pp.Keyword("def").suppress()# define the sub-expressionsidentifier = ppc.identifiertype_spec = identifier()arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))return_type_spec = type_spec()# define the full def statement expressiondef_stmt = (DEF + identifier("function_name")            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR            + ARROW + return_type_spec("returns") + COLON)# try it out!result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")print(result[0])             # prints 'pad'print(result.function_name)  # also prints 'pad'for arg in result.args:    print(arg.name, arg.type)def_stmt.run_tests("""\    def sin() -> float:    def abs(x) -> float:    def pad(s, min, max) -> str:    def abs(x: float) -> float:    def pad(s: str, min: int, max: int) -> str:    """)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Design Notes

What is Group for?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally