Like the tool?
Help making it better!
Your donation helps!

lxml
- lxml
- Why lxml?
  - Motto
  - Aims
- Installing lxml
- Benchmarks and Speed
- ElementTree compatibility of lxml.etree
- lxml FAQ - Frequently Asked Questions

Developing with lxml
- The lxml.etree Tutorial
- API reference
- APIs specific to lxml.etree
- Parsing XML and HTML with lxml
- Validation with lxml
- XPath and XSLT with lxml
  - XPath
  - XSLT
- lxml.objectify
- lxml.html
- lxml.cssselect
- BeautifulSoup Parser
- html5lib Parser
  - Differences to regular HTML parsing
  - Function Reference

Extending lxml
- Document loading and URL resolving
- Python extensions for XPath and XSLT
  - XPath Extension functions
  - XSLT extension elements
- Using custom Element classes in lxml
- Sax support
- The public C-API of lxml.etree

Like the tool?
Help making it better!
Your donation helps!

XPath and XSLT with lxml

lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions throughlibxml2 and libxslt in a standards compliant way.

Contents

The usual setup procedure:

>>>fromlxmlimportetree

XPath

lxml.etree supports the simple path syntax of thefind, findall andfindtext methods on ElementTree and Element, as known from the originalElementTree library (ElementPath). As an lxml specific extension, theseclasses also provide anxpath() method that supports expressions in thecomplete XPath syntax, as well ascustom extension functions.

There are also specialized XPath evaluator classes that are more efficient forfrequent evaluation:XPath andXPathEvaluator. See theperformancecomparison to learn when to use which. Their semantics when used onElements and ElementTrees are the same as for thexpath() method describedhere.

Note

The.find*() methods are usuallyfaster than the full-blown XPathsupport. They also support incremental tree processing through the.iterfind() method, whereas XPath always collects all results beforereturning them. They are therefore recommended over XPath for both speedand memory reasons, whenever there is no need for highly selective XPathqueries.

The`xpath()` method

For ElementTree, the xpath method performs a global XPath query against thedocument (if absolute) or against the root node (if relative):

>>>f=StringIO('<foo><bar></bar></foo>')>>>tree=etree.parse(f)>>>r=tree.xpath('/foo/bar')>>>len(r)1>>>r[0].tag'bar'>>>r=tree.xpath('bar')>>>r[0].tag'bar'

Whenxpath() is used on an Element, the XPath expression is evaluatedagainst the element (if relative) or against the root tree (if absolute):

>>>root=tree.getroot()>>>r=root.xpath('bar')>>>r[0].tag'bar'>>>bar=root[0]>>>r=bar.xpath('/foo/bar')>>>r[0].tag'bar'>>>tree=bar.getroottree()>>>r=tree.xpath('/foo/bar')>>>r[0].tag'bar'

Thexpath() method has support for XPath variables:

>>>expr="//*[local-name() = $name]">>>print(root.xpath(expr,name="foo")[0].tag)foo>>>print(root.xpath(expr,name="bar")[0].tag)bar>>>print(root.xpath("$text",text="Hello World!"))Hello World!

Namespaces and prefixes

If your XPath expression uses namespace prefixes, you must define themin a prefix mapping. To this end, pass a dictionary to thenamespaces keyword argument that maps the namespace prefixes usedin the XPath expression to namespace URIs:

>>>f=StringIO('''\...<a:foo xmlns:a="http://codespeak.net/ns/test1"...       xmlns:b="http://codespeak.net/ns/test2">...   <b:bar>Text</b:bar>...</a:foo>...''')>>>doc=etree.parse(f)>>>r=doc.xpath('/x:foo/b:bar',...namespaces={'x':'http://codespeak.net/ns/test1',...'b':'http://codespeak.net/ns/test2'})>>>len(r)1>>>r[0].tag'{http://codespeak.net/ns/test2}bar'>>>r[0].text'Text'

The prefixes you choose here are not linked to the prefixes usedinside the XML document. The document may define whatever prefixes itlikes, including the empty prefix, without breaking the above code.

Note that XPath does not have a notion of a default namespace. Theempty prefix is therefore undefined for XPath and cannot be used innamespace prefix mappings.

There is also an optionalextensions argument which is used todefinecustom extension functions in Python that are local to thisevaluation. The namespace prefixes that they use in the XPathexpression must also be defined in the namespace prefix mapping.

XPath return values

The return value types of XPath evaluations vary, depending on theXPath expression used:

True or False, when the XPath expression has a boolean result
a float, when the XPath expression has a numeric result (integer or float)
a 'smart' string (as described below), when the XPath expression hasa string result.
a list of items, when the XPath expression has a list as result.The items may include Elements (also comments and processinginstructions), strings and tuples. Text nodes and attributes in theresult are returned as 'smart' string values. Namespacedeclarations are returned as tuples of strings:(prefix, URI).

XPath string results are 'smart' in that they provide agetparent() method that knows their origin:

for attribute values,result.getparent() returns the Elementthat carries them. An example is//foo/@attribute, where theparent would be afoo Element.
for thetext() function (as in//text()), it returns theElement that contains the text or tail that was returned.

You can distinguish between different text origins with the booleanpropertiesis_text,is_tail andis_attribute.

Note thatgetparent() may not always return an Element. Forexample, the XPath functionsstring() andconcat() willconstruct strings that do not have an origin. For them,getparent() will return None.

There are certain cases where the smart string behaviour isundesirable. For example, it means that the tree will be kept aliveby the string, which may have a considerable memory impact in the casethat the string value is the only thing in the tree that is actuallyof interest. For these cases, you can deactivate the parentalrelationship using the keyword argumentsmart_strings.

>>>root=etree.XML("<root><a>TEXT</a></root>")>>>find_text=etree.XPath("//text()")>>>text=find_text(root)[0]>>>print(text)TEXT>>>print(text.getparent().text)TEXT>>>find_text=etree.XPath("//text()",smart_strings=False)>>>text=find_text(root)[0]>>>print(text)TEXT>>>hasattr(text,'getparent')False

Generating XPath expressions

ElementTree objects have a methodgetpath(element), which returns astructural, absolute XPath expression to find that element:

>>>a=etree.Element("a")>>>b=etree.SubElement(a,"b")>>>c=etree.SubElement(a,"c")>>>d1=etree.SubElement(c,"d")>>>d2=etree.SubElement(c,"d")>>>tree=etree.ElementTree(c)>>>print(tree.getpath(d2))/c/d[2]>>>tree.xpath(tree.getpath(d2))==[d2]True

The`XPath` class

TheXPath class compiles an XPath expression into a callable function:

>>>root=etree.XML("<root><a><b/></a><b/></root>")>>>find=etree.XPath("//b")>>>print(find(root)[0].tag)b

The compilation takes as much time as in thexpath() method, but it isdone only once per class instantiation. This makes it especially efficientfor repeated evaluation of the same XPath expression.

Just like thexpath() method, theXPath class supports XPathvariables:

>>>count_elements=etree.XPath("count(//*[local-name() = $name])")>>>print(count_elements(root,name="a"))1.0>>>print(count_elements(root,name="b"))2.0

This supports very efficient evaluation of modified versions of an XPathexpression, as compilation is still only required once.

Prefix-to-namespace mappings can be passed as second parameter:

>>>root=etree.XML("<root xmlns='NS'><a><b/></a><b/></root>")>>>find=etree.XPath("//n:b",namespaces={'n':'NS'})>>>print(find(root)[0].tag){NS}b

Regular expressions in XPath

By default,XPath supports regular expressions in theEXSLT namespace:

>>>regexpNS="http://exslt.org/regular-expressions">>>find=etree.XPath("//*[re:test(., '^abc$', 'i')]",...namespaces={'re':regexpNS})>>>root=etree.XML("<root><a>aB</a><b>aBc</b></root>")>>>print(find(root)[0].text)aBc

You can disable this with the boolean keyword argumentregexp whichdefaults to True.

The`XPathEvaluator` classes

lxml.etree provides two other efficient XPath evaluators that work onElementTrees or Elements respectively:XPathDocumentEvaluator andXPathElementEvaluator. They are automatically selected if you use theXPathEvaluator helper for instantiation:

>>>root=etree.XML("<root><a><b/></a><b/></root>")>>>xpatheval=etree.XPathEvaluator(root)>>>print(isinstance(xpatheval,etree.XPathElementEvaluator))True>>>print(xpatheval("//b")[0].tag)b

This class provides efficient support for evaluating different XPathexpressions on the same Element or ElementTree.

`ETXPath`

ElementTree supports a language namedElementPath in itsfind*() methods.One of the main differences between XPath and ElementPath is that the XPathlanguage requires an indirection through prefixes for namespace support,whereas ElementTree uses the Clark notation ({ns}name) to avoid prefixescompletely. The other major difference regards the capabilities of both pathlanguages. Where XPath supports various sophisticated ways of restricting theresult set through functions and boolean expressions, ElementPath onlysupports pure path traversal without nesting or further conditions. So, whilethe ElementPath syntax is self-contained and therefore easier to write andhandle, XPath is much more powerful and expressive.

lxml.etree bridges this gap through the classETXPath, which accepts XPathexpressions with namespaces in Clark notation. It is identical to theXPath class, except for the namespace notation. Normally, you wouldwrite:

>>>root=etree.XML("<root xmlns='ns'><a><b/></a><b/></root>")>>>find=etree.XPath("//p:b",namespaces={'p':'ns'})>>>print(find(root)[0].tag){ns}b

ETXPath allows you to change this to:

>>>find=etree.ETXPath("//{ns}b")>>>print(find(root)[0].tag){ns}b

Error handling

lxml.etree raises exceptions when errors occur while parsing or evaluating anXPath expression:

>>>find=etree.XPath("\\")Traceback (most recent call last):...lxml.etree.XPathSyntaxError:Invalid expression

lxml will also try to give you a hint what went wrong, so if you pass a morecomplex expression, you may get a somewhat more specific error:

>>>find=etree.XPath("//*[1.1.1]")Traceback (most recent call last):...lxml.etree.XPathSyntaxError:Invalid predicate

During evaluation, lxml will emit an XPathEvalError on errors:

>>>find=etree.XPath("//ns:a")>>>find(root)Traceback (most recent call last):...lxml.etree.XPathEvalError:Undefined namespace prefix

This works for theXPath class, however, the other evaluators (includingthexpath() method) are one-shot operations that do parsing and evaluationin one step. They therefore raise evaluation exceptions in all cases:

>>>root=etree.Element("test")>>>find=root.xpath("//*[1.1.1]")Traceback (most recent call last):...lxml.etree.XPathEvalError:Invalid predicate>>>find=root.xpath("//ns:a")Traceback (most recent call last):...lxml.etree.XPathEvalError:Undefined namespace prefix>>>find=root.xpath("\\")Traceback (most recent call last):...lxml.etree.XPathEvalError:Invalid expression

Note that lxml versions before 1.3 always raised anXPathSyntaxError forall errors, including evaluation errors. The best way to support olderversions is to except on the superclassXPathError.

XSLT

lxml.etree introduces a new class, lxml.etree.XSLT. The class can begiven an ElementTree or Element object to construct an XSLTtransformer:

>>>xslt_root=etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:template match="/">...        <foo><xsl:value-of select="/a/b/text()" /></foo>...    </xsl:template>...</xsl:stylesheet>''')>>>transform=etree.XSLT(xslt_root)

You can then run the transformation on an ElementTree document by simplycalling it, and this results in another ElementTree object:

>>>f=StringIO('<a><b>Text</b></a>')>>>doc=etree.parse(f)>>>result_tree=transform(doc)

By default, XSLT supports all extension functions from libxslt andlibexslt as well as Python regular expressions through theEXSLTregexp functions. Also see the documentation oncustom extensionfunctions,XSLT extension elements anddocument resolvers.There is a separate section oncontrolling access to externaldocuments and resources.

Note

Due to a bug in libxslt the usage of<xsl:strip-spaceelements="*"/>in an XSLT stylesheet can lead to crashes or memory failures. It is thereforeadvised not to usexsl:strip-space in stylesheets used with lxml.

For details see:https://gitlab.gnome.org/GNOME/libxslt/-/issues/14

XSLT result objects

The result of an XSL transformation can be accessed like a normal ElementTreedocument:

>>>root=etree.XML('<a><b>Text</b></a>')>>>result=transform(root)>>>result.getroot().text'Text'

but, as opposed to normal ElementTree objects, can also be turned into an (XMLor text) string by applying thebytes() function (str() in Python 2):

>>>bytes(result)b'<?xml version="1.0"?>\n<foo>Text</foo>\n'

The result is always a plain string, encoded as requested by thexsl:outputelement in the stylesheet. If you want a Python Unicode/Text string instead,you should set this encoding toUTF-8 (unless theASCII defaultis sufficient). This allows you to call the builtinstr() function onthe result (unicode() in Python 2):

>>>str(result)'<?xml version="1.0"?>\n<foo>Text</foo>\n'

You can use other encodings at the cost of multiple recoding. Encodings thatare not supported by Python will result in an error:

>>>xslt_tree=etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:output encoding="UCS4"/>...    <xsl:template match="/">...        <foo><xsl:value-of select="/a/b/text()" /></foo>...    </xsl:template>...</xsl:stylesheet>''')>>>transform=etree.XSLT(xslt_tree)>>>result=transform(doc)>>>str(result)Traceback (most recent call last):...LookupError:unknown encoding: UCS4

While it is possible to use the.write() method (known fromElementTreeobjects) to serialise the XSLT result into a file, it is better to use the.write_output() method. The latter knows about the<xsl:output> tagand writes the expected data into the output file.

>>>xslt_root=etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:output method="text" encoding="utf8" />...    <xsl:template match="/">...        <foo><xsl:value-of select="/a/b/text()" /></foo>...    </xsl:template>...</xsl:stylesheet>''')>>>transform=etree.XSLT(xslt_root)>>>result=transform(doc)>>>result.write_output("output.txt.gz",compression=9)# doctest: +SKIP

>>> from io import BytesIO>>> out = BytesIO()>>> result.write_output(out)>>> data = out.getvalue()>>> b'Text' in dataTrue

Stylesheet parameters

It is possible to pass parameters, in the form of XPath expressions, to theXSLT template:

>>>xslt_tree=etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:param name="a" />...    <xsl:template match="/">...        <foo><xsl:value-of select="$a" /></foo>...    </xsl:template>...</xsl:stylesheet>''')>>>transform=etree.XSLT(xslt_tree)>>>doc_root=etree.XML('<a><b>Text</b></a>')

The parameters are passed as keyword parameters to the transform call.First, let's try passing in a simple integer expression:

>>>result=transform(doc_root,a="5")>>>bytes(result)b'<?xml version="1.0"?>\n<foo>5</foo>\n'

You can use any valid XPath expression as parameter value:

>>>result=transform(doc_root,a="/a/b/text()")>>>bytes(result)b'<?xml version="1.0"?>\n<foo>Text</foo>\n'

It's also possible to pass an XPath object as a parameter:

>>>result=transform(doc_root,a=etree.XPath("/a/b/text()"))>>>bytes(result)b'<?xml version="1.0"?>\n<foo>Text</foo>\n'

Passing a string expression looks like this:

>>>result=transform(doc_root,a="'A'")>>>bytes(result)b'<?xml version="1.0"?>\n<foo>A</foo>\n'

To pass a string that (potentially) contains quotes, you can use the.strparam() class method. Note that it does not escape thestring. Instead, it returns an opaque object that keeps the stringvalue.

>>>plain_string_value=etree.XSLT.strparam(...""" It's "Monty Python" """)>>>result=transform(doc_root,a=plain_string_value)>>>bytes(result)b'<?xml version="1.0"?>\n<foo> It\'s "Monty Python" </foo>\n'

If you need to pass parameters that are not legal Python identifiers,pass them inside of a dictionary:

>>>transform=etree.XSLT(etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:param name="non-python-identifier" />...    <xsl:template match="/">...        <foo><xsl:value-of select="$non-python-identifier" /></foo>...    </xsl:template>...</xsl:stylesheet>'''))>>>result=transform(doc_root,**{'non-python-identifier':'5'})>>>bytes(result)b'<?xml version="1.0"?>\n<foo>5</foo>\n'

Errors and messages

Like most of the processing oriented objects in lxml.etree,XSLTprovides an error log that lists messages and error output from thelast run. See theparser documentation for a description of theerror log.

>>>xslt_root=etree.XML('''\...<xsl:stylesheet version="1.0"...    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...    <xsl:template match="/">...        <xsl:message terminate="no">STARTING</xsl:message>...        <foo><xsl:value-of select="/a/b/text()" /></foo>...        <xsl:message terminate="no">DONE</xsl:message>...    </xsl:template>...</xsl:stylesheet>''')>>>transform=etree.XSLT(xslt_root)>>>doc_root=etree.XML('<a><b>Text</b></a>')>>>result=transform(doc_root)>>>bytes(result)b'<?xml version="1.0"?>\n<foo>Text</foo>\n'>>>print(transform.error_log)<string>:0:0:ERROR:XSLT:ERR_OK: STARTING<string>:0:0:ERROR:XSLT:ERR_OK: DONE>>>forentryintransform.error_log:...print('message from line%s, col%s:%s'%(...entry.line,entry.column,entry.message))...print('domain:%s (%d)'%(entry.domain_name,entry.domain))...print('type:%s (%d)'%(entry.type_name,entry.type))...print('level:%s (%d)'%(entry.level_name,entry.level))...print('filename:%s'%entry.filename)message from line 0, col 0: STARTINGdomain: XSLT (22)type: ERR_OK (0)level: ERROR (2)filename: <string>message from line 0, col 0: DONEdomain: XSLT (22)type: ERR_OK (0)level: ERROR (2)filename: <string>

Note that there is no way in XSLT to distinguish between usermessages, warnings and error messages that occurred during therun.libxslt simply does not provide this information. You canpartly work around this limitation by making your own messagesuniquely identifiable, e.g. with a common text prefix.

The`xslt()` tree method

There's also a convenience method on ElementTree objects for doing XSLtransformations. This is less efficient if you want to apply the same XSLtransformation to multiple documents, but is shorter to write for one-shotoperations, as you do not have to instantiate a stylesheet yourself:

>>>result=doc.xslt(xslt_tree,a="'A'")>>>bytes(result)b'<?xml version="1.0"?>\n<foo>A</foo>\n'

This is a shortcut for the following code:

>>>transform=etree.XSLT(xslt_tree)>>>result=transform(doc,a="'A'")>>>bytes(result)b'<?xml version="1.0"?>\n<foo>A</foo>\n'

Dealing with stylesheet complexity

Some applications require a larger set of rather diverse stylesheets.lxml.etree allows you to deal with this in a number of ways. Here aresome ideas to try.

The most simple way to reduce the diversity is by using XSLTparameters that you pass at call time to configure the stylesheets.Thepartial() function in thefunctools modulemay come in handy here. It allows you to bind a set of keywordarguments (i.e. stylesheet parameters) to a reference of a callablestylesheet. The same works for instances of theXPath()evaluator, obviously.

You may also consider creating stylesheets programmatically. Justcreate an XSL tree, e.g. from a parsed template, and then add orreplace parts as you see fit. Passing an XSL tree into theXSLT()constructor multiple times will create independent stylesheets, solater modifications of the tree will not be reflected in the alreadycreated stylesheets. This makes stylesheet generation very straightforward.

A third thing to remember is the support forcustom extensionfunctions andXSLT extension elements. Some things are mucheasier to express in XSLT than in Python, while for others it is thecomplete opposite. Finding the right mixture of Python code and XSLcode can help a great deal in keeping applications well designed andmaintainable.

Profiling

If you want to know how your stylesheet performed, pass theprofile_runkeyword to the transform:

>>>result=transform(doc,a="/a/b/text()",profile_run=True)>>>profile=result.xslt_profile

The value of thexslt_profile property is an ElementTree with profilingdata about each template, similar to the following:

<profile><templaterank="1"match="/"name=""mode=""calls="1"time="1"average="1"/></profile>

Note that this is a read-only document. You must not move any of its elementsto other documents. Please deep-copy the document if you need to modify it.If you want to free it from memory, just do:

>>>delresult.xslt_profile

Movatterモバイル変換