Movatterモバイル変換


[0]ホーム

URL:


Navigation

parser — Access Python parse trees

Theparser module provides an interface to Python’s internal parser andbyte-code compiler. The primary purpose for this interface is to allow Pythoncode to edit the parse tree of a Python expression and create executable codefrom this. This is better than trying to parse and modify an arbitrary Pythoncode fragment as a string because parsing is performed in a manner identical tothe code forming the application. It is also faster.

Note

From Python 2.5 onward, it’s much more convenient to cut in at the AbstractSyntax Tree (AST) generation and compilation stage, using theastmodule.

Theparser module exports the names documented here also with “st”replaced by “ast”; this is a legacy from the time when there was no otherAST and has nothing to do with the AST found in Python 2.5. This is also thereason for the functions’ keyword arguments being calledast, notst.The “ast” functions will be removed in Python 3.0.

There are a few things to note about this module which are important to makinguse of the data structures created. This is not a tutorial on editing the parsetrees for Python code, but some examples of using theparser module arepresented.

Most importantly, a good understanding of the Python grammar processed by theinternal parser is required. For full information on the language syntax, refertoThe Python Language Reference. The parseritself is created from a grammar specification defined in the fileGrammar/Grammar in the standard Python distribution. The parse treesstored in the ST objects created by this module are the actual output from theinternal parser when created by theexpr() orsuite() functions,described below. The ST objects created bysequence2st() faithfullysimulate those structures. Be aware that the values of the sequences which areconsidered “correct” will vary from one version of Python to another as theformal grammar for the language is revised. However, transporting code from onePython version to another as source text will always allow correct parse treesto be created in the target version, with the only restriction being thatmigrating to an older version of the interpreter will not support more recentlanguage constructs. The parse trees are not typically compatible from oneversion to another, whereas source code has always been forward-compatible.

Each element of the sequences returned byst2list() orst2tuple()has a simple form. Sequences representing non-terminal elements in the grammaralways have a length greater than one. The first element is an integer whichidentifies a production in the grammar. These integers are given symbolic namesin the C header fileInclude/graminit.h and the Python modulesymbol. Each additional element of the sequence represents a componentof the production as recognized in the input string: these are always sequenceswhich have the same form as the parent. An important aspect of this structurewhich should be noted is that keywords used to identify the parent node type,such as the keywordif in anif_stmt, are included in thenode tree without any special treatment. For example, theif keywordis represented by the tuple(1,'if'), where1 is the numeric valueassociated with allNAME tokens, including variable and function namesdefined by the user. In an alternate form returned when line number informationis requested, the same token might be represented as(1,'if',12), wherethe12 represents the line number at which the terminal symbol was found.

Terminal elements are represented in much the same way, but without any childelements and the addition of the source text which was identified. The exampleof theif keyword above is representative. The various types ofterminal symbols are defined in the C header fileInclude/token.h andthe Python moduletoken.

The ST objects are not required to support the functionality of this module,but are provided for three purposes: to allow an application to amortize thecost of processing complex parse trees, to provide a parse tree representationwhich conserves memory space when compared to the Python list or tuplerepresentation, and to ease the creation of additional modules in C whichmanipulate parse trees. A simple “wrapper” class may be created in Python tohide the use of ST objects.

Theparser module defines functions for a few distinct purposes. Themost important purposes are to create ST objects and to convert ST objects toother representations such as parse trees and compiled code objects, but thereare also functions which serve to query the type of parse tree represented by anST object.

See also

Modulesymbol
Useful constants representing internal nodes of the parse tree.
Moduletoken
Useful constants representing leaf nodes of the parse tree and functions fortesting node values.

Creating ST Objects

ST objects may be created from source code or from a parse tree. When creatingan ST object from source, different functions are used to create the'eval'and'exec' forms.

parser.expr(source)
Theexpr() function parses the parametersource as if it were an inputtocompile(source,'file.py','eval'). If the parse succeeds, an ST objectis created to hold the internal parse tree representation, otherwise anappropriate exception is thrown.
parser.suite(source)
Thesuite() function parses the parametersource as if it were an inputtocompile(source,'file.py','exec'). If the parse succeeds, an ST objectis created to hold the internal parse tree representation, otherwise anappropriate exception is thrown.
parser.sequence2st(sequence)

This function accepts a parse tree represented as a sequence and builds aninternal representation if possible. If it can validate that the tree conformsto the Python grammar and all nodes are valid node types in the host version ofPython, an ST object is created from the internal representation and returnedto the called. If there is a problem creating the internal representation, orif the tree cannot be validated, aParserError exception is thrown. AnST object created this way should not be assumed to compile correctly; normalexceptions thrown by compilation may still be initiated when the ST object ispassed tocompilest(). This may indicate problems not related to syntax(such as aMemoryError exception), but may also be due to constructs suchas the result of parsingdelf(0), which escapes the Python parser but ischecked by the bytecode compiler.

Sequences representing terminal tokens may be represented as either two-elementlists of the form(1,'name') or as three-element lists of the form(1,'name',56). If the third element is present, it is assumed to be a validline number. The line number may be specified for any subset of the terminalsymbols in the input tree.

parser.tuple2st(sequence)
This is the same function assequence2st(). This entry point ismaintained for backward compatibility.

Converting ST Objects

ST objects, regardless of the input used to create them, may be converted toparse trees represented as list- or tuple- trees, or may be compiled intoexecutable code objects. Parse trees may be extracted with or without linenumbering information.

parser.st2list(ast[,line_info])

This function accepts an ST object from the caller inast and returns aPython list representing the equivalent parse tree. The resulting listrepresentation can be used for inspection or the creation of a new parse tree inlist form. This function does not fail so long as memory is available to buildthe list representation. If the parse tree will only be used for inspection,st2tuple() should be used instead to reduce memory consumption andfragmentation. When the list representation is required, this function issignificantly faster than retrieving a tuple representation and converting thatto nested lists.

Ifline_info is true, line number information will be included for allterminal tokens as a third element of the list representing the token. Notethat the line number provided specifies the line on which the tokenends.This information is omitted if the flag is false or omitted.

parser.st2tuple(ast[,line_info])

This function accepts an ST object from the caller inast and returns aPython tuple representing the equivalent parse tree. Other than returning atuple instead of a list, this function is identical tost2list().

Ifline_info is true, line number information will be included for allterminal tokens as a third element of the list representing the token. Thisinformation is omitted if the flag is false or omitted.

parser.compilest(ast[,filename='<syntax-tree>'])

The Python byte compiler can be invoked on an ST object to produce code objectswhich can be used as part of anexec statement or a call to thebuilt-ineval() function. This function provides the interface to thecompiler, passing the internal parse tree fromast to the parser, using thesource file name specified by thefilename parameter. The default valuesupplied forfilename indicates that the source was an ST object.

Compiling an ST object may result in exceptions related to compilation; anexample would be aSyntaxError caused by the parse tree fordelf(0):this statement is considered legal within the formal grammar for Python but isnot a legal language construct. TheSyntaxError raised for thiscondition is actually generated by the Python byte-compiler normally, which iswhy it can be raised at this point by theparser module. Most causes ofcompilation failure can be diagnosed programmatically by inspection of the parsetree.

Queries on ST Objects

Two functions are provided which allow an application to determine if an ST wascreated as an expression or a suite. Neither of these functions can be used todetermine if an ST was created from source code viaexpr() orsuite() or from a parse tree viasequence2st().

parser.isexpr(ast)

Whenast represents an'eval' form, this function returns true, otherwiseit returns false. This is useful, since code objects normally cannot be queriedfor this information using existing built-in functions. Note that the codeobjects created bycompilest() cannot be queried like this either, andare identical to those created by the built-incompile() function.

parser.issuite(ast)
This function mirrorsisexpr() in that it reports whether an ST objectrepresents an'exec' form, commonly known as a “suite.” It is not safe toassume that this function is equivalent tonotisexpr(ast), as additionalsyntactic fragments may be supported in the future.

Exceptions and Error Handling

The parser module defines a single exception, but may also pass other built-inexceptions from other portions of the Python runtime environment. See eachfunction for information about the exceptions it can raise.

exceptionparser.ParserError
Exception raised when a failure occurs within the parser module. This isgenerally produced for validation failures rather than the built inSyntaxError thrown during normal parsing. The exception argument iseither a string describing the reason of the failure or a tuple containing asequence causing the failure from a parse tree passed tosequence2st()and an explanatory string. Calls tosequence2st() need to be able tohandle either type of exception, while calls to other functions in the modulewill only need to be aware of the simple string values.

Note that the functionscompilest(),expr(), andsuite() maythrow exceptions which are normally thrown by the parsing and compilationprocess. These include the built in exceptionsMemoryError,OverflowError,SyntaxError, andSystemError. In thesecases, these exceptions carry all the meaning normally associated with them.Refer to the descriptions of each function for detailed information.

ST Objects

Ordered and equality comparisons are supported between ST objects. Pickling ofST objects (using thepickle module) is also supported.

parser.STType
The type of the objects returned byexpr(),suite() andsequence2st().

ST objects have the following methods:

ST.compile([filename])
Same ascompilest(st,filename).
ST.isexpr()
Same asisexpr(st).
ST.issuite()
Same asissuite(st).
ST.tolist([line_info])
Same asst2list(st,line_info).
ST.totuple([line_info])
Same asst2tuple(st,line_info).

Examples

The parser modules allows operations to be performed on the parse tree of Pythonsource code before thebytecode is generated, and provides for inspection of theparse tree for information gathering purposes. Two examples are presented. Thesimple example demonstrates emulation of thecompile() built-in functionand the complex example shows the use of a parse tree for information discovery.

Emulation ofcompile()

While many useful operations may take place between parsing and bytecodegeneration, the simplest operation is to do nothing. For this purpose, usingtheparser module to produce an intermediate data structure is equivalentto the code

>>>code=compile('a + 5','file.py','eval')>>>a=5>>>eval(code)10

The equivalent operation using theparser module is somewhat longer, andallows the intermediate internal parse tree to be retained as an ST object:

>>>importparser>>>st=parser.expr('a + 5')>>>code=st.compile('file.py')>>>a=5>>>eval(code)10

An application which needs both ST and code objects can package this code intoreadily available functions:

importparserdefload_suite(source_string):st=parser.suite(source_string)returnst,st.compile()defload_expression(source_string):st=parser.expr(source_string)returnst,st.compile()

Information Discovery

Some applications benefit from direct access to the parse tree. The remainderof this section demonstrates how the parse tree provides access to moduledocumentation defined in docstrings without requiring that the code beingexamined be loaded into a running interpreter viaimport. This canbe very useful for performing analyses of untrusted code.

Generally, the example will demonstrate how the parse tree may be traversed todistill interesting information. Two functions and a set of classes aredeveloped which provide programmatic access to high level function and classdefinitions provided by a module. The classes extract information from theparse tree and provide access to the information at a useful semantic level, onefunction provides a simple low-level pattern matching capability, and the otherfunction defines a high-level interface to the classes by handling fileoperations on behalf of the caller. All source files mentioned here which arenot part of the Python installation are located in theDemo/parser/directory of the distribution.

The dynamic nature of Python allows the programmer a great deal of flexibility,but most modules need only a limited measure of this when defining classes,functions, and methods. In this example, the only definitions that will beconsidered are those which are defined in the top level of their context, e.g.,a function defined by adef statement at column zero of a module, butnot a function defined within a branch of anif ...elseconstruct, though there are some good reasons for doing so in some situations.Nesting of definitions will be handled by the code developed in the example.

To construct the upper-level extraction methods, we need to know what the parsetree structure looks like and how much of it we actually need to be concernedabout. Python uses a moderately deep parse tree so there are a large number ofintermediate nodes. It is important to read and understand the formal grammarused by Python. This is specified in the fileGrammar/Grammar in thedistribution. Consider the simplest case of interest when searching fordocstrings: a module consisting of a docstring and nothing else. (See filedocstring.py.)

"""Some documentation."""

Using the interpreter to take a look at the parse tree, we find a bewilderingmass of numbers and parentheses, with the documentation buried deep in nestedtuples.

>>>importparser>>>importpprint>>>st=parser.suite(open('docstring.py').read())>>>tup=st.totuple()>>>pprint.pprint(tup)(257, (264,  (265,   (266,    (267,     (307,      (287,       (288,        (289,         (290,          (292,           (293,            (294,             (295,              (296,               (297,                (298,                 (299,                  (300, (3, '"""Some documentation.\n"""'))))))))))))))))),   (4, ''))), (4, ''), (0, ''))

The numbers at the first element of each node in the tree are the node types;they map directly to terminal and non-terminal symbols in the grammar.Unfortunately, they are represented as integers in the internal representation,and the Python structures generated do not change that. However, thesymbol andtoken modules provide symbolic names for the node typesand dictionaries which map from the integers to the symbolic names for the nodetypes.

In the output presented above, the outermost tuple contains four elements: theinteger257 and three additional tuples. Node type257 has the symbolicnamefile_input. Each of these inner tuples contains an integer as thefirst element; these integers,264,4, and0, represent the nodetypesstmt,NEWLINE, andENDMARKER, respectively.Note that these values may change depending on the version of Python you areusing; consultsymbol.py andtoken.py for details of themapping. It should be fairly clear that the outermost node is related primarilyto the input source rather than the contents of the file, and may be disregardedfor the moment. Thestmt node is much more interesting. Inparticular, all docstrings are found in subtrees which are formed exactly asthis node is formed, with the only difference being the string itself. Theassociation between the docstring in a similar tree and the defined entity(class, function, or module) which it describes is given by the position of thedocstring subtree within the tree defining the described structure.

By replacing the actual docstring with something to signify a variable componentof the tree, we allow a simple pattern matching approach to check any givensubtree for equivalence to the general pattern for docstrings. Since theexample demonstrates information extraction, we can safely require that the treebe in tuple form rather than list form, allowing a simple variablerepresentation to be['variable_name']. A simple recursive function canimplement the pattern matching, returning a Boolean and a dictionary of variablename to value mappings. (See fileexample.py.)

fromtypesimportListType,TupleTypedefmatch(pattern,data,vars=None):ifvarsisNone:vars={}iftype(pattern)isListType:vars[pattern[0]]=datareturn1,varsiftype(pattern)isnotTupleType:return(pattern==data),varsiflen(data)!=len(pattern):return0,varsforpattern,datainmap(None,pattern,data):same,vars=match(pattern,data,vars)ifnotsame:breakreturnsame,vars

Using this simple representation for syntactic variables and the symbolic nodetypes, the pattern for the candidate docstring subtrees becomes fairly readable.(See fileexample.py.)

importsymbolimporttokenDOCSTRING_STMT_PATTERN=(symbol.stmt,(symbol.simple_stmt,(symbol.small_stmt,(symbol.expr_stmt,(symbol.testlist,(symbol.test,(symbol.and_test,(symbol.not_test,(symbol.comparison,(symbol.expr,(symbol.xor_expr,(symbol.and_expr,(symbol.shift_expr,(symbol.arith_expr,(symbol.term,(symbol.factor,(symbol.power,(symbol.atom,(token.STRING,['docstring']))))))))))))))))),(token.NEWLINE,'')))

Using thematch() function with this pattern, extracting the moduledocstring from the parse tree created previously is easy:

>>>found,vars=match(DOCSTRING_STMT_PATTERN,tup[1])>>>found1>>>vars{'docstring': '"""Some documentation.\n"""'}

Once specific data can be extracted from a location where it is expected, thequestion of where information can be expected needs to be answered. Whendealing with docstrings, the answer is fairly simple: the docstring is the firststmt node in a code block (file_input orsuite nodetypes). A module consists of a singlefile_input node, and class andfunction definitions each contain exactly onesuite node. Classes andfunctions are readily identified as subtrees of code block nodes which startwith(stmt,(compound_stmt,(classdef,... or(stmt,(compound_stmt,(funcdef,.... Note that these subtrees cannot be matched bymatch()since it does not support multiple sibling nodes to match without regard tonumber. A more elaborate matching function could be used to overcome thislimitation, but this is sufficient for the example.

Given the ability to determine whether a statement might be a docstring andextract the actual string from the statement, some work needs to be performed towalk the parse tree for an entire module and extract information about the namesdefined in each context of the module and associate any docstrings with thenames. The code to perform this work is not complicated, but bears someexplanation.

The public interface to the classes is straightforward and should probably besomewhat more flexible. Each “major” block of the module is described by anobject providing several methods for inquiry and a constructor which accepts atleast the subtree of the complete parse tree which it represents. TheModuleInfo constructor accepts an optionalname parameter since itcannot otherwise determine the name of the module.

The public classes includeClassInfo,FunctionInfo, andModuleInfo. All objects provide the methodsget_name(),get_docstring(),get_class_names(), andget_class_info(). TheClassInfo objects supportget_method_names() andget_method_info() while the other classes provideget_function_names() andget_function_info().

Within each of the forms of code block that the public classes represent, mostof the required information is in the same form and is accessed in the same way,with classes having the distinction that functions defined at the top level arereferred to as “methods.” Since the difference in nomenclature reflects a realsemantic distinction from functions defined outside of a class, theimplementation needs to maintain the distinction. Hence, most of thefunctionality of the public classes can be implemented in a common base class,SuiteInfoBase, with the accessors for function and method informationprovided elsewhere. Note that there is only one class which represents functionand method information; this parallels the use of thedef statementto define both types of elements.

Most of the accessor functions are declared inSuiteInfoBase and do notneed to be overridden by subclasses. More importantly, the extraction of mostinformation from a parse tree is handled through a method called by theSuiteInfoBase constructor. The example code for most of the classes isclear when read alongside the formal grammar, but the method which recursivelycreates new information objects requires further examination. Here is therelevant part of theSuiteInfoBase definition fromexample.py:

classSuiteInfoBase:_docstring=''_name=''def__init__(self,tree=None):self._class_info={}self._function_info={}iftree:self._extract_info(tree)def_extract_info(self,tree):# extract docstringiflen(tree)==2:found,vars=match(DOCSTRING_STMT_PATTERN[1],tree[1])else:found,vars=match(DOCSTRING_STMT_PATTERN,tree[3])iffound:self._docstring=eval(vars['docstring'])# discover inner definitionsfornodeintree[1:]:found,vars=match(COMPOUND_STMT_PATTERN,node)iffound:cstmt=vars['compound']ifcstmt[0]==symbol.funcdef:name=cstmt[2][1]self._function_info[name]=FunctionInfo(cstmt)elifcstmt[0]==symbol.classdef:name=cstmt[2][1]self._class_info[name]=ClassInfo(cstmt)

After initializing some internal state, the constructor calls the_extract_info() method. This method performs the bulk of the informationextraction which takes place in the entire example. The extraction has twodistinct phases: the location of the docstring for the parse tree passed in, andthe discovery of additional definitions within the code block represented by theparse tree.

The initialif test determines whether the nested suite is of the“short form” or the “long form.” The short form is used when the code block ison the same line as the definition of the code block, as in

defsquare(x):"Square an argument.";returnx**2

while the long form uses an indented block and allows nested definitions:

defmake_power(exp):"Make a function that raises an argument to the exponent `exp'."defraiser(x,y=exp):returnx**yreturnraiser

When the short form is used, the code block may contain a docstring as thefirst, and possibly only,small_stmt element. The extraction of such adocstring is slightly different and requires only a portion of the completepattern used in the more common case. As implemented, the docstring will onlybe found if there is only onesmall_stmt node in thesimple_stmt node. Since most functions and methods which use the shortform do not provide a docstring, this may be considered sufficient. Theextraction of the docstring proceeds using thematch() function asdescribed above, and the value of the docstring is stored as an attribute of theSuiteInfoBase object.

After docstring extraction, a simple definition discovery algorithm operates onthestmt nodes of thesuite node. The special case of theshort form is not tested; since there are nostmt nodes in the shortform, the algorithm will silently skip the singlesimple_stmt node andcorrectly not discover any nested definitions.

Each statement in the code block is categorized as a class definition, functionor method definition, or something else. For the definition statements, thename of the element defined is extracted and a representation object appropriateto the definition is created with the defining subtree passed as an argument tothe constructor. The representation objects are stored in instance variablesand may be retrieved by name using the appropriate accessor methods.

The public classes provide any accessors required which are more specific thanthose provided by theSuiteInfoBase class, but the real extractionalgorithm remains common to all forms of code blocks. A high-level function canbe used to extract the complete set of information from a source file. (Seefileexample.py.)

defget_docs(fileName):importosimportparsersource=open(fileName).read()basename=os.path.basename(os.path.splitext(fileName)[0])st=parser.suite(source)returnModuleInfo(st.totuple(),basename)

This provides an easy-to-use interface to the documentation of a module. Ifinformation is required which is not extracted by the code of this example, thecode may be extended at clearly defined points to provide additionalcapabilities.

Table Of Contents

Previous topic

Python Language Services

Next topic

Abstract Syntax Trees

This Page

Quick search

Navigation

©Copyright 1990-2008, Python Software Foundation. Last updated on Oct 02, 2008. Created usingSphinx 0.5.

[8]ページ先頭

©2009-2026 Movatter.jp