Like the tool?
Help making it better!
Your donation helps!

lxml
- lxml
- Why lxml?
  - Motto
  - Aims
- Installing lxml
- Benchmarks and Speed
- ElementTree compatibility of lxml.etree
- lxml FAQ - Frequently Asked Questions

Developing with lxml
- The lxml.etree Tutorial
- API reference
- APIs specific to lxml.etree
- Parsing XML and HTML with lxml
- Validation with lxml
- XPath and XSLT with lxml
  - XPath
  - XSLT
- lxml.objectify
- lxml.html
- lxml.cssselect
- BeautifulSoup Parser
- html5lib Parser
  - Differences to regular HTML parsing
  - Function Reference

Extending lxml
- Document loading and URL resolving
- Python extensions for XPath and XSLT
  - XPath Extension functions
  - XSLT extension elements
- Using custom Element classes in lxml
- Sax support
- The public C-API of lxml.etree

Like the tool?
Help making it better!
Your donation helps!

APIs specific to lxml.etree

lxml.etree tries to follow established APIs wherever possible. Sometimes,however, the need to expose a feature in an easy way led to the invention of anew API. This page describes the major differences and a few additions to themain ElementTree API.

For a complete reference of the API, see thegenerated APIdocumentation.

Separate pages describe the support forparsing XML, executingXPath andXSLT,validating XML and interfacing with other XML tools through theSAX-API.

lxml is extremely extensible throughXPath functions in Python, customPython element classes, customURL resolvers and evenat the C-level.

Contents

lxml.etree

lxml.etree tries to follow theElementTree API wherever it can. There arehowever some incompatibilities (seecompatibility). The extensions aredocumented here.

If you need to know which version of lxml is installed, you can access thelxml.etree.LXML_VERSION attribute to retrieve a version tuple. Note,however, that it did not exist before version 1.0, so you will get anAttributeError in older versions. The versions of libxml2 and libxslt areavailable through the attributesLIBXML_VERSION andLIBXSLT_VERSION.

The following examples usually assume this to be executed first:

>>>fromlxmlimportetree

Other Element APIs

While lxml.etree itself uses the ElementTree API, it is possible to replacethe Element implementation bycustom element subclasses. This has beenused to implement well-known XML APIs on top of lxml. For example, lxml shipswith a data-binding implementation calledobjectify, which is similar totheAmara bindery tool.

lxml.etree comes with a number ofdifferent lookup schemes to customize themapping between libxml2 nodes and the Element classes used by lxml.etree.

Trees and Documents

Compared to the original ElementTree API, lxml.etree has an extended treemodel. It knows about parents and siblings of elements:

>>>root=etree.Element("root")>>>a=etree.SubElement(root,"a")>>>b=etree.SubElement(root,"b")>>>c=etree.SubElement(root,"c")>>>d=etree.SubElement(root,"d")>>>e=etree.SubElement(d,"e")>>>b.getparent()==rootTrue>>>print(b.getnext().tag)c>>>print(c.getprevious().tag)b

Elements always live within a document context in lxml. This implies thatthere is also a notion of an absolute document root. You can retrieve anElementTree for the root node of a document from any of its elements.

>>>tree=d.getroottree()>>>print(tree.getroot().tag)root

Note that this is different from wrapping an Element in an ElementTree. Youcan use ElementTrees to create XML trees with an explicit root node:

>>>tree=etree.ElementTree(d)>>>print(tree.getroot().tag)d>>>etree.tostring(tree)b'<d><e/></d>'

ElementTree objects are serialised as complete documents, includingpreceding or trailing processing instructions and comments.

All operations that you run on such an ElementTree (like XPath, XSLT, etc.)will understand the explicitly chosen root as root node of a document. Theywill not see any elements outside the ElementTree. However, ElementTrees donot modify their Elements:

>>>element=tree.getroot()>>>print(element.tag)d>>>print(element.getparent().tag)root>>>print(element.getroottree().getroot().tag)root

The rule is that all operations that are applied to Elements use either theElement itself as reference point, or the absolute root of the document thatcontains this Element (e.g. for absolute XPath expressions). All operationson an ElementTree use its explicit root node as reference.

Iteration

The ElementTree API makes Elements iterable to supports iteration over theirchildren. Using the tree defined above, we get:

>>>[child.tagforchildinroot]['a', 'b', 'c', 'd']

To iterate in the opposite direction, use the builtinreversed() function.

Tree traversal should use theelement.iter() method:

>>>[el.tagforelinroot.iter()]['root', 'a', 'b', 'c', 'd', 'e']

lxml.etree also supports this, but additionally features an extended API foriteration over the children, following/preceding siblings, ancestors anddescendants of an element, as defined by the respective XPath axis:

>>>[child.tagforchildinroot.iterchildren()]['a', 'b', 'c', 'd']>>>[child.tagforchildinroot.iterchildren(reversed=True)]['d', 'c', 'b', 'a']>>>[sibling.tagforsiblinginb.itersiblings()]['c', 'd']>>>[sibling.tagforsiblinginc.itersiblings(preceding=True)]['b', 'a']>>>[ancestor.tagforancestorine.iterancestors()]['d', 'root']>>>[el.tagforelinroot.iterdescendants()]['a', 'b', 'c', 'd', 'e']

Note howelement.iterdescendants() does not include the elementitself, as opposed toelement.iter(). The latter effectivelyimplements the 'descendant-or-self' axis in XPath.

All of these iterators support one (or more, since lxml 3.0) additionalarguments that filter the generated elements by tag name:

>>>[child.tagforchildinroot.iterchildren('a')]['a']>>>[child.tagforchildind.iterchildren('a')][]>>>[el.tagforelinroot.iterdescendants('d')]['d']>>>[el.tagforelinroot.iter('d')]['d']>>>[el.tagforelinroot.iter('d','a')]['a', 'd']

Note that the order of the elements is determined by the iteration order,which is the document order in most cases (except for preceding siblingsand ancestors, where it is the reversed document order). The order ofthe tag selection arguments is irrelevant, as you can see in the lastexample.

The most common way to traverse an XML tree is depth-first, whichtraverses the tree in document order. This is implemented by the.iter() method. While there is no dedicated method forbreadth-first traversal, it is almost as simple if you use thecollections.deque type.

>>>root=etree.XML('<root><a><b/><c/></a><d><e/></d></root>')>>>print(etree.tostring(root,pretty_print=True,encoding='unicode'))<root>  <a>    <b/>    <c/>  </a>  <d>    <e/>  </d></root>>>>fromcollectionsimportdeque>>>queue=deque([root])>>>whilequeue:...el=queue.popleft()# pop next element...queue.extend(el)# append its children...print(el.tag)rootadbce

See also the section on the utility functionsiterparse() anditerwalk() in theparser documentation.

Error handling on exceptions

Libxml2 provides error messages for failures, be it during parsing, XPathevaluation or schema validation. The preferred way of accessing them isthrough the localerror_log property of the respective evaluator ortransformer object. See their documentation for details.

However, lxml also keeps a global error log of all errors that occurred at theapplication level. Whenever an exception is raised, you can retrieve theerrors that occurred and "might have" lead to the problem from the error logcopy attached to the exception:

>>>etree.clear_error_log()>>>broken_xml='''...<root>...  <a>...</root>...'''>>>try:...etree.parse(StringIO(broken_xml))...exceptetree.XMLSyntaxErrorase:...pass# just put the exception into e

Once you have caught this exception, you can access itserror_log propertyto retrieve the log entries or filter them by a specific type, error domain orerror level:

>>>log=e.error_log.filter_from_level(etree.ErrorLevels.FATAL)>>>print(log[0])<string>:4:8:FATAL:PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag mismatch: a line 3 and root

This might look a little cryptic at first, but it is the information thatlibxml2 gives you. At least the message at the end should give you a hintwhat went wrong and you can see that the fatal errors (FATAL) happened duringparsing (PARSER) lines 4, column 8 and line 5, column 1 of a string (<string>,or the filename if available). Here, PARSER is the so-called error domain,seelxml.etree.ErrorDomains for that. You can get it from a log entrylike this:

>>>entry=log[0]>>>print(entry.domain_name)PARSER>>>print(entry.type_name)ERR_TAG_NAME_MISMATCH>>>print(entry.filename)<string>

There is also a convenience attributeerror_log.last_error that returns thelast error or fatal error that occurred, so that it's easy to test if there wasan error at all. Note, however, that there might have been more than one error,and the first error that occurred might be more relevant in some cases.

Error logging

lxml.etree supports logging libxml2 messages to the Python stdlib loggingmodule. This is done through theetree.PyErrorLog class. It disables theerror reporting from exceptions and forwards log messages to a Python logger.To use it, see the descriptions of the functionetree.useGlobalPythonLogand the classetree.PyErrorLog for help. Note that this does not affectthe local error logs of XSLT, XMLSchema, etc.

Serialisation

C14N

lxml.etree has support forC14N 1.0andC14N 2.0. When serialising an XMLtree usingElementTree.write() ortostring(), you can pass the optionmethod="c14n" for 1.0 ormethod="c14n2" for 2.0.

Additionally, there is a functionetree.canonicalize() which can be usedto convert serialised XML to its canonical form directly, without creatinga tree in memory. By default, it returns the canonical output, but can bedirected to write it to a file instead.

>>>c14n_xml=etree.canonicalize("<root><test z='1' y='2'/></root>")>>>print(c14n_xml)<root><test y="2" z="1"></test></root>

Pretty printing

Functions likeElementTree.write() andtostring() also support prettyprinting XML through a keyword argument:

>>>root=etree.XML("<root><test/></root>")>>>etree.tostring(root)b'<root><test/></root>'>>>print(etree.tostring(root,pretty_print=True))<root>  <test/></root>

Note the newline that is appended at the end when pretty printing theoutput. It was added in lxml 2.0.

XML declaration

By default, lxml (just as ElementTree) outputs the XML declaration only if itis required by the standard:

>>>unicode_root=etree.Element("t\u3120st")>>>unicode_root.text="t\u0A0Ast">>>etree.tostring(unicode_root,encoding="utf-8")b'<t\xe3\x84\xa0st>t\xe0\xa8\x8ast</t\xe3\x84\xa0st>'>>>print(etree.tostring(unicode_root,encoding="iso-8859-1"))<?xml version='1.0' encoding='iso-8859-1'?><t&#12576;st>t&#2570;st</t&#12576;st>

Also see the general remarks onUnicode support.

You can enable or disable the declaration explicitly by passing anotherkeyword argument for the serialisation:

>>>print(etree.tostring(root,xml_declaration=True))<?xml version='1.0' encoding='ASCII'?><root><test/></root>>>>unicode_root.clear()>>>etree.tostring(unicode_root,encoding="UTF-16LE",...xml_declaration=False)b'<\x00t\x00 1s\x00t\x00/\x00>\x00'

Note that a standard compliant XML parser will not consider the last linewell-formed XML if the encoding is not explicitly provided somehow, e.g. in anunderlying transport protocol:

>>>notxml=etree.tostring(unicode_root,encoding="UTF-16LE",...xml_declaration=False)>>>root=etree.XML(notxml)#doctest: +ELLIPSISTraceback (most recent call last):...lxml.etree.XMLSyntaxError:...

Since version 2.3, the serialisation can override the internal subsetof the document with a user provided DOCTYPE:

>>>xml='<!DOCTYPE root>\n<root/>'>>>tree=etree.parse(StringIO(xml))>>>print(etree.tostring(tree))<!DOCTYPE root><root/>>>>print(etree.tostring(tree,...doctype='<!DOCTYPE root SYSTEM "/tmp/test.dtd">'))<!DOCTYPE root SYSTEM "/tmp/test.dtd"><root/>

The content will be encoded, but otherwise copied verbatim into theoutput stream. It is therefore left to the user to take care for acorrect doctype format, including the name of the root node.

Incremental XML generation

Since version 3.1, lxml provides anxmlfile API for incrementallygenerating XML using thewith statement. It's main purpose is tofreely and safely mix surrounding elements with pre-built in-memorytrees, e.g. to write out large documents that consist mostly ofrepetitive subtrees (like database dumps). But it can be useful inmany cases where memory consumption matters or where XML is naturallygenerated in sequential steps. Since lxml 3.4.1, there is an equivalentcontext manager for HTML serialisation calledhtmlfile.

The API can serialise to real files (given as file path or fileobject), as well as file-like objects, e.g.io.BytesIO().Here is a simple example:

>>> f = BytesIO()>>> with etree.xmlfile(f) as xf:...     with xf.element('abc'):...         xf.write('text')>>> print(f.getvalue().decode('utf-8'))<abc>text</abc>

xmlfile() accepts a file path as first argument, or a file(-like)object, as in the example above. In the first case, it takes care toopen and close the file itself, whereas file(-like) objects are notclosed by default. This is left to the code that opened them. Sincelxml 3.4, however, you can pass the argumentclose=True to makelxml call the object's.close() method when exiting the xmlfilecontext manager.

To insert pre-constructed Elements and subtrees, just pass themintowrite():

>>> f = BytesIO()>>> with etree.xmlfile(f) as xf:...     with xf.element('abc'):...         with xf.element('in'):......             for value in '123':...                 # construct a really complex XML tree...                 el = etree.Element('xyz', attr=value)......                 xf.write(el)......                 # no longer needed, discard it right away!...                 el = None>>> print(f.getvalue().decode('utf-8'))<abc><in><xyz attr="1"/><xyz attr="2"/><xyz attr="3"/></in></abc>

It is a common pattern to have one or more nestedelement()blocks, and then build in-memory XML subtrees in a loop (using theElementTree API, the builder API, XSLT, or whatever) and write themout into the XML file one after the other. That way, they can beremoved from memory right after their construction, which can largelyreduce the memory footprint of an application, while keeping theoverall XML generation easy, safe and correct.

Together with Python coroutines, this can be used to generate XMLin an asynchronous, non-blocking fashion, e.g. for a stream protocollike the instant messaging protocolXMPP:

def writer(out_stream):    with xmlfile(out_stream) as xf:        with xf.element('{http://etherx.jabber.org/streams}stream'):            while True:                el = (yield)                xf.write(el)                xf.flush()w = writer(stream)next(w)   # start writing (run up to 'yield')

Then, whenever XML elements are available for writing, call

w.send(element)

And when done:

w.close()

Note the additionalxf.flush() call in the example above, which isavailable since lxml 3.4. Normally, the output stream is buffered toavoid excessive I/O calls. Whenever the internal buffer fills up, itscontent is written out. In the case above, however, we want to makesure that each message that we write (i.e. each element subtree) iswritten out immediately, so we flush the content explicitly at theright point.

Alternatively, if buffering is not desired at all, it can be disabledby passing the flagbuffered=False intoxmlfile() (also sincelxml 3.4).

Here is a similar example using an async coroutine in Py3.5 or later, which issupported since lxml 4.0. The output stream is expected to have methodsasync def write(self, data) andasync def close(self) in this case.

async def writer(out_stream, xml_messages):    async with xmlfile(out_stream) as xf:        async with xf.element('{http://etherx.jabber.org/streams}stream'):             async for el in xml_messages:                  await xf.write(el)                  await xf.flush()class DummyAsyncOut(object):    async def write(self, data):        print(data.decode('utf8'))    async def close(self):         passstream = DummyAsyncOut()async_writer = writer(stream, async_message_stream)

CDATA

By default, lxml's parser will strip CDATA sections from the tree andreplace them by their plain text content. As real applications forCDATA are rare, this is the best way to deal with this issue.

However, in some cases, keeping CDATA sections or creating them in adocument is required to adhere to existing XML language definitions.For these special cases, you can instruct the parser to leave CDATAsections in the document:

>>>parser=etree.XMLParser(strip_cdata=False)>>>root=etree.XML('<root><![CDATA[test]]></root>',parser)>>>root.text'test'>>>etree.tostring(root)b'<root><![CDATA[test]]></root>'

Note how the.text property does not give any indication that thetext content is wrapped by a CDATA section. If you want to make sureyour data is wrapped by a CDATA block, you can use theCDATA()text wrapper:

>>>root.text='test'>>>root.text'test'>>>etree.tostring(root)b'<root>test</root>'>>>root.text=etree.CDATA(root.text)>>>root.text'test'>>>etree.tostring(root)b'<root><![CDATA[test]]></root>'

XInclude and ElementInclude

You can let lxml process xinclude statements in a document by calling thexinclude() method on a tree:

>>>data=StringIO('''\...<doc xmlns:xi="http://www.w3.org/2001/XInclude">...<foo/>...<xi:include href="doc/test.xml" />...</doc>''')>>>tree=etree.parse(data)>>>tree.xinclude()>>>print(etree.tostring(tree.getroot()))<doc xmlns:xi="http://www.w3.org/2001/XInclude"><foo/><a xml:base="doc/test.xml"/></doc>

Note that the ElementTree compatibleElementInclude module is also supportedaslxml.ElementInclude. It has the additional advantage of supportingcustomURL resolvers at the Python level. The normal XInclude mechanismcannot deploy these. If you need ElementTree compatibility or customresolvers, you have to stick to the external Python module.

Movatterモバイル変換