lxml
- lxml
- Why lxml?
  - Motto
  - Aims
- Installing lxml
- Benchmarks and Speed
- ElementTree compatibility of lxml.etree
- lxml FAQ - Frequently Asked Questions

Developing with lxml
- The lxml.etree Tutorial
- API reference
- APIs specific to lxml.etree
- Parsing XML and HTML with lxml
- Validation with lxml
- XPath and XSLT with lxml
  - XPath
  - XSLT
- lxml.objectify
- lxml.html
- lxml.cssselect
- BeautifulSoup Parser
- html5lib Parser
  - Differences to regular HTML parsing
  - Function Reference

Extending lxml
- Document loading and URL resolving
- Python extensions for XPath and XSLT
  - XPath Extension functions
  - XSLT extension elements
- Using custom Element classes in lxml
- Sax support
- The public C-API of lxml.etree
  - Writing external modules in Cython
  - Writing external modules in C

lxml FAQ - Frequently Asked Questions

Frequently asked questions on lxml. See also the notes oncompatibility toElementTree.

Contents

General Questions

Is there a tutorial?

Read thelxml.etree Tutorial. While this is still work in progress(just as any good documentation), it provides an overview of the mostimportant concepts inlxml.etree. If you want to help out,improving the tutorial is a very good place to start.

There is also atutorial for ElementTree which works forlxml.etree. The documentation of theextended etree API alsocontains many examples forlxml.etree. Fredrik Lundh'selementlibrary contains a lot of nice recipes that show how to solve commontasks in ElementTree and lxml.etree. To learn usinglxml.objectify, read theobjectify documentation.

John Shipman has written another tutorial calledPython XMLprocessing with lxml that contains lots of examples. Liza Dalywrote a nice article about high-performance aspects whenparsinglarge files with lxml.

Where can I find more documentation about lxml?

There is a lot of documentation on the web and also in the Pythonstandard library documentation, as lxml implements the well-knownElementTree API and tries to follow its documentation as closely aspossible. The recipes in Fredrik Lundh'selement library aregenerally worth taking a look at. There are a couple of issues wherelxml cannot keep up compatibility. They are described in thecompatibility documentation.

The lxml specific extensions to the API are described by individualfiles in thedoc directory of the source distribution and ontheweb page.

Thegenerated API documentation is a comprehensive API referencefor the lxml package.

What standards does lxml implement?

The compliance to XML Standards depends on the support in libxml2 and libxslt.Here is a quote fromhttp://xmlsoft.org/:

In most cases libxml2 tries to implement the specifications in a relativelystrictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ testsfrom the OASIS XML Tests Suite.

lxml currently supports libxml2 2.6.20 or later, which has even bettersupport for various XML standards. The important ones are:

XML 1.0
HTML 4
XML namespaces
XML Schema 1.0
XPath 1.0
XInclude 1.0
XSLT 1.0
EXSLT
XML catalogs
canonical XML
RelaxNG
xml:id
xml:base

Support for XML Schema is currently not 100% complete in libxml2, butis definitely very close to compliance. Schematron is supported intwo ways, the best being the original ISO Schematron referenceimplementation via XSLT. libxml2 also supports loading documentsthrough HTTP and FTP.

ForRelaxNG Compact Syntaxsupport, there is a tool calledrnc2rng,written by David Mertz, which you might be able to use from Python. Failingthat,trang is the 'official'command line tool (written in Java) to do the conversion.

Who uses lxml?

As an XML library, lxml is often used under the hood of in-houseserver applications, such as web servers or applications thatfacilitate some kind of content management. Many people who deployZope,Plone orDjango use it together with lxml in the background,without speaking publicly about it. Therefore, it is hard to get anidea of who uses it, and the following list of 'users and projects weknow of' is very far from a complete list of lxml's users.

Also note that the compatibility to the ElementTree library does notrequire projects to set a hard dependency on lxml - as long as they donot take advantage of lxml's enhanced feature set.

cssutils,a CSS parser and toolkit, can be used withlxml.cssselect
Deliverance,a content theming tool
Enfold Proxy 4,a web server accelerator with on-the-fly XSLT processing
Inteproxy,a secure HTTP proxy
lwebstring,an XML template engine
OpenXMLlib,a library for handling OpenXML document meta data
PsychoPy,psychology software in Python
Pycoon,a WSGI web development framework based on XML pipelines
PyQuery,a query framework for XML/HTML, similar to jQuery for JavaScript
python-docx,a package for handling Microsoft's Word OpenXML format
Rambler,a meta search engine that aggregates different data sources
rdfadict,an RDFa parser with a simple dictionary-like interface.
xupdate-processor,an XUpdate implementation for lxml.etree
Diazo,an XSLT-under-the-hood web site theming engine

Zope3 and some of its extensions have good support for lxml:

gocept.lxml,Zope3 interface bindings for lxml
z3c.rml,an implementation of ReportLab's RML format
zif.sedna,an XQuery based interface to the Sedna OpenSource XML database

And don't miss the quotes by our generallyhappy users, and othersites that link to lxml. AsLiza Daly puts it: "Many softwareproducts come with the pick-two caveat, meaning that you must chooseonly two: speed, flexibility, or readability. When used carefully,lxml can provide all three."

What is the difference between lxml.etree and lxml.objectify?

The two modules provide different ways of handling XML. However, objectifybuilds on top of lxml.etree and therefore inherits most of its capabilitiesand a large portion of its API.

lxml.etree is a generic API for XML and HTML handling. It aims forElementTreecompatibility and supports the entire XML infoset. It is wellsuited for both mixed content and data centric XML. Its generality makes itthe best choice for most applications.
lxml.objectify is a specialized API for XML data handling in a Python objectsyntax. It provides a very natural way to deal with data fields stored in astructurally well defined XML format. Data is automatically converted toPython data types and can be manipulated with normal Python operators. Lookat the examples in theobjectify documentation to see what it feels liketo use it.
Objectify is not well suited for mixed contents or HTML documents. As it isbuilt on top of lxml.etree, however, it inherits the normal support forXPath, XSLT or validation.

How can I make my application run faster?

lxml.etree is a very fast library for processing XML. There are, however,afew caveats involved in the mapping of the powerful libxml2 library to thesimple and convenient ElementTree API. Not all operations are as fast as thesimplicity of the API might suggest, while some use cases can heavily benefitfrom finding the right way of doing them. Thebenchmark page has acomparison to other ElementTree implementations and a number of tips forperformance tweaking. As with any Python application, the rule of thumb is:the more of your processing runs in C, the faster your application gets. Seealso the section onthreading.

What about that trailing text on serialised Elements?

The ElementTree tree model defines an Element as a container with a tag name,contained text, child Elements and a tail text. This means that whenever youserialise an Element, you will get all parts of that Element:

>>>root=etree.XML("<root><tag>text<child/></tag>tail</root>")>>>print(etree.tostring(root[0]))<tag>text<child/></tag>tail

Here is an example that shows why not serialising the tail would beeven more surprising from an object point of view:

>>>root=etree.Element("test")>>>root.text="TEXT">>>print(etree.tostring(root))<test>TEXT</test>>>>root.tail="TAIL">>>print(etree.tostring(root))<test>TEXT</test>TAIL>>>root.tail=None>>>print(etree.tostring(root))<test>TEXT</test>

Just imagine a Python list where you append an item and it doesn'tshow up when you look at the list.

The.tail property is a huge simplification for the tree model asit avoids text nodes to appear in the list of children and makesaccess to them quick and simple. So this is a benefit in mostapplications and simplifies many, many XML tree algorithms.

However, in document-like XML (and especially HTML), the above result can beunexpected to new users and can sometimes require a bit more overhead. A goodway to deal with this is to use helper functions that copy the Element withoutits tail. Thelxml.html package also deals with this in a couple ofplaces, as most HTML algorithms benefit from a tail-free behaviour.

How can I find out if an Element is a comment or PI?

>>>root=etree.XML("<?my PI?><root><!-- empty --></root>")>>>root.tag'root'>>>root.getprevious().tagisetree.PITrue>>>root[0].tagisetree.CommentTrue

How can I map an XML tree into a dict of dicts?

I'm glad you asked.

defrecursive_dict(element):returnelement.tag, \dict(map(recursive_dict,element))orelement.text

Why does lxml sometimes return 'str' values for text in Python 2?

In Python 2, lxml's API returns byte strings for plain ASCII textvalues, be it for tag names or text in Element content. This is thesame behaviour as known from ElementTree. The reasoning is that ASCIIencoded byte strings are compatible with Unicode strings in Python 2,but consume less memory (usually by a factor of 2 or 4) and are fasterto create because they do not require decoding. Plain ASCII stringvalues are very common in XML, so this optimisation is generally worthit.

In Python 3, lxml always returns Unicode strings for text and names,as does ElementTree. Since Python 3.3, Unicode strings containingonly characters that can be encoded in ASCII or Latin-1 are generallyas efficient as byte strings. In older versions of Python 3, theabove mentioned drawbacks apply.

Installation

Which version of libxml2 and libxslt should I use or require?

It really depends on your application, but the rule of thumb is: more recentversions contain less bugs and provide more features.

Do not use libxml2 2.6.27 if you want to use XPath (including XSLT). Youwill get crashes when XPath errors occur during the evaluation (e.g. forunknown functions). This happens inside the evaluation call to libxml2, sothere is nothing that lxml can do about it.
Try to use versions of both libraries that were released together. At leastthe libxml2 version should not be older than the libxslt version.
If you use XML Schema or Schematron which are still under development, themost recent version of libxml2 is usually a good bet.
The same applies to XPath, where a substantial number of bugs and memoryleaks were fixed over time. If you encounter crashes or memory leaks inXPath applications, try a more recent version of libxml2.
For parsing and fixing broken HTML, lxml requires at least libxml2 2.6.21.
For the normal tree handling, however, any libxml2 version starting with2.6.20 should do.

Read therelease notes of libxml2 and therelease notes of libxslt tosee when (or if) a specific bug has been fixed.

Where are the binary builds?

Binary builds are most often requested by users of Microsoft Windows.Two of the major design issues of this operating system make itnon-trivial for its users to build lxml: the lack of a pre-installedstandard compiler and the missing package management.

For recent lxml releases, PyPI provides community donated Windows binaries.Besides that, Christoph Gohlke generously providesunofficial lxml binarybuilds for Windowsthat are usually very up to date. Consider using them if you prefer abinary build over a signed official source release.

Why do I get errors about missing UCS4 symbols when installing lxml?

You are using a Python installation that was configured for a differentinternal Unicode representation than the lxml package you are trying toinstall. CPython versions before 3.3 allowed to switch between two typesat build time: the 32 bit encoding UCS4 and the 16 bit encoding UCS2.Sadly, both are not compatible, so eggs and other binary distributionscan only support the one they were compiled with.

This means that you have to compile lxml from sources for your system. Notethat you do not need Cython for this, the lxml source distribution is directlycompilable on both platform types. See thebuild instructions on how to dothis.

Contributing

Why is lxml not written in Python?

Italmost is.

lxml is not written in plain Python, because it interfaces with two Clibraries: libxml2 and libxslt. Accessing them at the C-level isrequired for performance reasons.

However, to avoid writing plain C-code and caring too much about thedetails of built-in types and reference counting, lxml is written inCython, a superset of the Python language that translates to C-code.Chances are that if you know Python, you can writecode that Cythonaccepts. Again, the C-ish style used in the lxml code is just forperformance optimisations. If you want to contribute, don't botherwith the details, a Python implementation of your contribution isbetter than none. And keep in mind that lxml's flexible API oftenfavours an implementation of features in pure Python, withoutbothering with C-code at all. For example, thelxml.html packageis written entirely in Python.

Please contact themailing list if you need any help.

How can I contribute?

If you find something that you would like lxml to do (or do better),then please tell us about it on themailing list. Patches arealways appreciated, especially when accompanied by unit tests anddocumentation (doctests would be great). See thetestssubdirectories in the lxml source tree (below thesrc directory)and theReST text files in thedoc directory.

We also have alist of missing features that we would like toimplement but didn't due to lack if time. Ifyou find the time,patches are very welcome.

Besides enhancing the code, there are a lot of places where you can help theproject and its user base. You can

spread the word and write about lxml. Many users (especially new Pythonusers) have not yet heared about lxml, although our user base is constantlygrowing. If you write your own blog and feel like saying something aboutlxml, go ahead and do so. If we think your contribution or criticism isvaluable to other users, we may even put a link or a quote on the projectpage.
provide code examples for the general usage of lxml or specific problemssolved with lxml. Readable code is a very good way of showing how a librarycan be used and what great things you can do with it. Again, if we hearabout it, we can set a link on the project page.
work on the documentation. The web page is generated from a set ofReST text files. It is meant both as a representative project page for lxmland as a site for documenting lxml's API and usage. If you have questionsor an idea how to make it more readable and accessible while you are readingit, please send a comment to themailing list.
enhance the web site. We put some work into making the web siteusable, understandable and also easy to find, but there's alwaysthings that can be done better. You may notice that we are nottop-ranked when searching the web for "Python and XML", so maybe youhave an idea how to improve that.
help with the tutorial. A tutorial is the most important stating point fornew users, so it is important for us to provide an easy to understand guideinto lxml. As allo documentation, the tutorial is work in progress, so weappreciate every helping hand.
improve the docstrings. lxml uses docstrings to support Python's integratedonlinehelp() function. However, sometimes these are not sufficient tograsp the details of the function in question. If you find such a place,you can try to write up a better description and send it to themailinglist.

Bugs

My application crashes!

One of the goals of lxml is "no segfaults", so if there is no clearwarning in the documentation that you were doing something potentiallyharmful, you have found a bug and we would like to hear about it.Please report this bug to themailing list. See the section on bugreporting to learn how to do that.

If your application (or e.g. your web container) uses threads, pleasesee the FAQ section onthreading to check if you touch on one of thepotential pitfalls.

In any case, try to reproduce the problem with the latest versions oflibxml2 and libxslt. From time to time, bugs and race conditions are foundin these libraries, so a more recent version might already contain a fix foryour problem.

Remember: even if you see lxml appear in a crash stack trace, it isnot necessarily lxml thatcaused the crash.

My application crashes on MacOS-X!

This was a common problem up to lxml 2.1.x. Since lxml 2.2, the onlyofficially supported way to use it on this platform is through astatic build against freshly downloaded versions of libxml2 andlibxslt. See the build instructions forMacOS-X.

I think I have found a bug in lxml. What should I do?

First, you should look at thecurrent developer changelog to see if thisis a known problem that has already been fixed in the master branch since therelease you are using.

Also, the 'crash' section above has a few good advices what to try to see ifthe problem is really in lxml - and not in your setup. Believe it or not,that happens more often than you might think, especially when old librariesor even multiple library versions are installed.

You should always try to reproduce the problem with the latestversions of libxml2 and libxslt - and make sure they are used.lxml.etree can tell you what it runs with:

importsysfromlxmlimportetreeprint("%-20s:%s"%('Python',sys.version_info))print("%-20s:%s"%('lxml.etree',etree.LXML_VERSION))print("%-20s:%s"%('libxml used',etree.LIBXML_VERSION))print("%-20s:%s"%('libxml compiled',etree.LIBXML_COMPILED_VERSION))print("%-20s:%s"%('libxslt used',etree.LIBXSLT_VERSION))print("%-20s:%s"%('libxslt compiled',etree.LIBXSLT_COMPILED_VERSION))

If you can figure that the problem is not in lxml but in theunderlying libxml2 or libxslt, you can ask right on the respectivemailing lists, which may considerably reduce the time to find a fix orwork-around. See the next question for some hints on how to do that.

Otherwise, we would really like to hear about it. Please report it tothebug tracker or to themailing list so that we can fix it.It is very helpful in this case if you can come up with a short codesnippet that demonstrates your problem. If others can reproduce andsee the problem, it is much easier for them to fix it - and maybe eveneasier for you to describe it and get people convinced that it reallyis a problem to fix.

It is important that you always report the version of lxml, libxml2and libxslt that you get from the code snippet above. If we do notknow the library versions you are using, we will ask back, so it willtake longer for you to get a helpful answer.

Since as a user of lxml you are likely a programmer, you might findthis article on bug reports an interesting read.

How do I know a bug is really in lxml and not in libxml2?

A large part of lxml's functionality is implemented by libxml2 andlibxslt, so problems that you encounter may be in one or the other.Knowing the right place to ask will reduce the time it takes to fixthe problem, or to find a work-around.

Both libxml2 and libxslt come with their own command line frontends,namelyxmllint andxsltproc. If you encounter problems withXSLT processing for specific stylesheets or with validation forspecific schemas, try to run the XSLT withxsltproc or thevalidation withxmllint respectively to find out if it fails thereas well. If it does, please report directly to the mailing lists ofthe respective project, namely:

On the other hand, everything that seems to be related to Python code,including custom resolvers, custom XPath functions, etc. is likelyoutside of the scope of libxml2/libxslt. If you encounter problemshere or you are not sure where there the problem may come from, pleaseask on the lxml mailing list first.

In any case, a good explanation of the problem including some simpletest code and some input data will help us (or the libxml2 developers)see and understand the problem, which largely increases your chance ofgetting help. See the question above for a few hints on what ishelpful here.

Threading

Can I use threads to concurrently access the lxml API?

Short answer: yes, if you use lxml 2.2 and later.

Since version 1.1, lxml frees the GIL (Python's global interpreterlock) internally when parsing from disk and memory, as long as you useeither the default parser (which is replicated for each thread) orcreate a parser for each thread yourself. lxml also allowsconcurrency during validation (RelaxNG and XMLSchema) and XSLtransformation. You can share RelaxNG, XMLSchema and XSLT objectsbetween threads.

While you can also share parsers between threads, this will serializethe access to each of them, so it is better to.copy() parsers orto just use the default parser if you do not need any specialconfiguration. The same applies to the XPath evaluators, which use aninternal lock to protect their prepared evaluation contexts. It istherefore best to use separate evaluator instances in threads.

Warning: Before lxml 2.2, and especially before 2.1, there werevarious issues when moving subtrees between different threads, or whenapplying XSLT objects from one thread to trees parsed or modified inanother. If you need code to run with older versions, you shouldgenerally avoid modifying trees in other threads than the one it wasgenerated in. Although this should work in many cases, there arecertain scenarios where the termination of a thread that parsed a treecan crash the application if subtrees of this tree were moved to otherdocuments. You should be on the safe side when passing trees betweenthreads if you either

do not modify these trees and do not move their elements to othertrees, or
do not terminate threads while the trees they parsed are still inuse (e.g. by using a fixed size thread-pool or long-running threadsin processing chains)

Since lxml 2.2, even multi-thread pipelines are supported. However,note that it is more efficient to do all tree work inside one thread,than to let multiple threads work on a tree one after the other. Thisis because trees inherit state from the thread that created them,which must be maintained when the tree is modified inside anotherthread.

Does my program run faster if I use threads?

Depends. The best way to answer this is timing and profiling.

The global interpreter lock (GIL) in Python serializes access to theinterpreter, so if the majority of your processing is done in Pythoncode (walking trees, modifying elements, etc.), your gain will beclose to zero. The more of your XML processing moves into lxml,however, the higher your gain. If your application is bound by XMLparsing and serialisation, or by very selective XPath expressions andcomplex XSLTs, your speedup on multi-processor machines can besubstantial.

See the question above to learn which operations free the GIL to supportmulti-threading.

Would my single-threaded program run faster if I turned off threading?

Possibly, yes. You can see for yourself by compiling lxml entirelywithout threading support. Pass the--without-threading option tosetup.py when building lxml from source. You can also build libxml2without pthread support (--without-pthreads option), which may addanother bit of performance. Note that this will leave internal datastructures entirely without thread protection, so make sure you reallydo not use lxml outside of the main application thread in this case.

Why can't I reuse XSLT stylesheets in other threads?

Since later lxml 2.0 versions, you can do this. There is someoverhead involved as the result document needs an additional cleanuptraversal when the input document and/or the stylesheet were createdin other threads. However, on a multi-processor machine, the gain offreeing the GIL easily covers this drawback.

If you need even the last bit of performance, consider keeping (a copyof) the stylesheet in thread-local storage, and try creating the inputdocument(s) in the same thread. And do not forget to benchmark yourcode to see if the increased code complexity is really worth it.

My program crashes when run with mod_python/Pyro/Zope/Plone/...

These environments can use threads in a way that may not make it obvious whenthreads are created and what happens in which thread. This makes it hard toensure lxml's threading support is used in a reliable way. Sadly, if problemsarise, they are as diverse as the applications, so it is difficult to provideany generally applicable solution. Also, these environments are so complexthat problems become hard to debug and even harder to reproduce in apredictable way. If you encounter crashes in one of these systems, but yourcode runs perfectly when started by hand, the following gives you a few hintsfor possible approaches to solve your specific problem:

make sure you use recent versions of libxml2, libxslt and lxml. Thelibxml2 developers keep fixing bugs in each release, and lxml alsotries to become more robust against possible pitfalls. So newerversions might already fix your problem in a reliable way. Version2.2 of lxml contains many improvements.
make sure the library versions you installed are really used. Donot rely on what your operating system tells you! Print the versionconstants inlxml.etree from within your runtime environment tomake sure it is the case. This is especially a problem underMacOS-X when newer library versions were installed in addition tothe outdated system libraries. Please read the bugs sectionregarding MacOS-X in this FAQ.
if you usemod_python, try setting this option:
PythonInterpreter main_interpreter
There was a discussion on the mailing list about this problem:
http://comments.gmane.org/gmane.comp.python.lxml.devel/2942
in a threaded environment, try to initially importlxml.etreefrom the main application thread instead of doing first-time importsseparately in each spawned worker thread. If you cannot control thethread spawning of your web/application server, an import oflxml.etree in sitecustomize.py or usercustomize.py may still dothe trick.
compile lxml without threading support by runningsetup.py with the--without-threading option. While this might be slower in certainscenarios on multi-processor systems, itmight also keep your applicationfrom crashing, which should be worth more to you than peek performance.Remember that lxml is fast anyway, so concurrency may not even be worth it.
look out for fancy XSLT stuff like foreign document access orpassing in subtrees trough XSLT variables. This might or might notwork, depending on your specific usage. Again, later versions oflxml and libxslt provide safer support here.
try copying trees at suspicious places in your code and working withthose instead of a tree shared between threads. Note that thecopying must happen inside the target thread to be effective, not inthe thread that created the tree. Serialising in one thread andparsing in another is also a simple (and fast) way of separatingthread contexts.
try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,instead of sharing one. Also see the question above.
you can try to serialise suspicious parts of your code with explicit threadlocks, thus disabling the concurrency of the runtime system.
report back on the mailing list to see if there are other ways to workaround your specific problems. Do not forget to report the version numbersof lxml, libxml2 and libxslt you are using (see the question on reportinga bug).

Note that most of these options will degrade performance and/or yourcode quality. If you are unsure what to do, please ask on the mailinglist.

Parsing and Serialisation

Why doesn't the`pretty_print` option reformat my XML output?

Pretty printing (or formatting) an XML document means adding white space tothe content. These modifications are harmless if they only impact elements inthe document that do not carry (text) data. They corrupt your data if theyimpact elements that contain data. If lxml cannot distinguish betweenwhitespace and data, it will not alter your data. Whitespace is thereforeonly added between nodes that do not contain data. This is always the casefor trees constructed element-by-element, so no problems should be expectedhere. For parsed trees, a good way to assure that no conflicting whitespaceis left in the tree is theremove_blank_text option:

>>>parser=etree.XMLParser(remove_blank_text=True)>>>tree=etree.parse(filename,parser)

This will allow the parser to drop blank text nodes when constructing thetree. If you now call a serialization function to pretty print this tree,lxml can add fresh whitespace to the XML tree to indent it.

Note that theremove_blank_text option also uses a heuristic if ithas no definite knowledge about the document's ignorable whitespace.It will keep blank text nodes that appear after non-blank text nodesat the same level. This is to prevent document-style XML from loosingcontent.

The HTMLParser has this structural knowledge built-in, which means thatmost whitespace that appears between tags in HTML documents willnotbe removed by this option, except in places where it is truly ignorable,e.g. in the page header, between table structure tags, etc. Therefore,it is also safe to use this option with the HTMLParser, as it will keepcontent like the following intact (i.e. it will not remove the spacethat separates the two words):

<p><b>some</b><em>text</em></p>

If you want to be sure all blank text is removed from an XML document(or just more blank text than the parser does by itself), you have touse either a DTD to tell the parser which whitespace it can safelyignore, or remove the ignorable whitespace manually after parsing,e.g. by setting all tail text to None:

forelementinroot.iter():element.tail=None

Fredrik Lundh also has a Python-level function for indenting XML byappending whitespace to tags. It can be found on hiselementlibrary recipe page.

Why can't lxml parse my XML from unicode strings?

First of all, XML is explicitly defined as a stream of bytes. It's notUnicode text. Take a look at theXML specification, it's all about bytesequences and how to map them to text and structure. That leads to rulenumber one: do not decode your XML data yourself. That's a part of thework of an XML parser, and it does it very well. Just pass it your data asa plain byte stream, it will always do the right thing, by specification.

This also includes not opening XML files in text mode. Make sure you alwaysuse binary mode, or, even better, pass the file path into lxml'sparse()function to let it do the file opening, reading and closing itself. Thisis the most simple and most efficient way to do it.

That being said, lxml can read Python unicode strings and even tries tosupport them if libxml2 does not. This is because there is one valid usecase for parsing XML from text strings: literal XML fragments in sourcecode.

However, if the unicode string declares an XML encoding internally(<?xmlencoding="..."?>), parsing is bound to fail, as this encoding isalmost certainly not the real encoding used in Python unicode. The same istrue for HTML unicode strings that contain charset meta tags, although theproblems may be more subtle here. The libxml2 HTML parser may not be ableto parse the meta tags in broken HTML and may end up ignoring them, so evenif parsing succeeds, later handling may still fail with character encodingerrors. Therefore, parsing HTML from unicode strings is a much saner thingto do than parsing XML from unicode strings.

Note that Python uses different encodings for unicode on different platforms,so even specifying the real internal unicode encoding is not portable betweenPython interpreters. Don't do it.

Python unicode strings with XML data that carry encoding information arebroken. lxml will not parse them. You must provide parsable data in avalid encoding.

Can lxml parse from file objects opened in unicode/text mode?

Technically, yes. However, you likely do not want to do that, becauseit is extremely inefficient. The text encoding that libxml2 usesinternally is UTF-8, so parsing from a Unicode file means that Pythonfirst reads a chunk of data from the file, then decodes it into a newbuffer, and then copies it into a new unicode string object, just tolet libxml2 make yet another copy while encoding it down into UTF-8in order to parse it. It's clear that this involves a lot morerecoding and copying than when parsing straight from the bytes thatthe file contains.

If you really know the encoding better than the parser (e.g. whenparsing HTML that lacks a content declaration), then instead of passingan encoding parameter into the file object when opening it, create anew instance of an XMLParser or HTMLParser and pass the encoding intoits constructor. Afterwards, use that parser for parsing, e.g. bypassing it into theetree.parse(file, parser) function. Rememberto open the file in binary mode (mode="rb"), or, if possible, preferpassing the file path directly intoparse() instead of an openedPython file object.

What is the difference between str(xslt(doc)) and xslt(doc).write() ?

The str() implementation of the XSLTResultTree class (a subclass of theElementTree class) knows about the output method chosen in the stylesheet(xsl:output), write() doesn't. If you call write(), the result will be anormal XML tree serialization in the requested encoding. Calling this methodmay also fail for XSLT results that are not XML trees (e.g. string results).

If you call str(), it will return the serialized result as specified by theXSL transform. This correctly serializes string results to encoded Pythonstrings and honoursxsl:output options likeindent. This almostcertainly does what you want, so you should only usewrite() if you aresure that the XSLT result is an XML tree and you want to override the encodingand indentation options requested by the stylesheet.

Why can't I just delete parents or clear the root node in iterparse()?

Theiterparse() implementation is based on the libxml2 parser. Itrequires the tree to be intact to finish parsing. If you delete or modifyparents of the current node, chances are you modify the structure in a waythat breaks the parser. Normally, this will result in a segfault. Pleaserefer to theiterparse section of the lxml API documentation to find outwhat you can do and what you can't do.

How do I output null characters in XML text?

Don't. What you would produce is not well-formed XML. XML parserswill refuse to parse a document that contains null characters. Theright way to embed binary data in XML is using a text encoding such asuuencode or base64.

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser oflibxml2. Since libxml2 version 2.7, the parser imposes hard securitylimits on input documents to prevent DoS attacks with forged inputdata. Since lxml 2.2.1, you can disable these limits with thehuge_tree parser option if you need to parsereally large,trusted documents. All lxml versions will leave these restrictionsenabled by default.

Note that libxml2 versions of the 2.6 series do not restrict theirparser and are therefore vulnerable to DoS attacks.

Note also that these "hard limits" may still be high enough toallow for excessive resource usage in a given use case. They arecompile time modifiable, so building your own library versions willallow you to change the limits to your own needs. Also see the nextquestion.

How do I use lxml safely as a web-service endpoint?

XML based web-service endpoints are generally subject to severaltypes of attacks if they allow some kind of untrusted input.From the point of view of the underlying XML tool, the mostobvious attacks try to send a relatively small amount of datathat induces a comparatively large resource consumption on thereceiver side.

First of all, make sure network access is not enabled for the XMLparser that you use for parsing untrusted content and that it isnot configured to load external DTDs. Otherwise, attackers cantry to trick the parser into an attempt to load external resourcesthat are overly slow or impossible to retrieve, thus wasting timeand other valuable resources on your server such as socketconnections. Note that you can register your own document loaderin lxml, which allows for fine-grained control over any read accessto resources.

Some of the most famous excessive content expansion attacksuse XML entity references. Luckily, entity expansion is mostlyuseless for the data commonly sent through web services andcan simply be disabled, which rules out several types ofdenial of service attacks at once. This also involves an attackthat reads local files from the server, as XML entities can bedefined to expand into their content. Consequently, version1.2 of the SOAP standard explicitly disallows entity referencesin the XML stream.

To disable entity expansion, use an XML parser that is configuredwith the optionresolve_entities=False. Then, after (orwhile) parsing the document, useroot.iter(etree.Entity) torecursively search for entity references. If it contains any,reject the entire input document with a suitable error response.In lxml 3.x, you can also use the new DTD introspection API toapply your own restrictions on input documents.

Another attack to consider is compression bombs. If you allowcompressed input into your web service, attackers can try to sendwell forged highly repetitive and thus very well compressing inputthat unpacks into a very large XML document in your server's mainmemory, potentially a thousand times larger than the compressedinput data.

As a counter measure, either disable compressed input for yourweb server, at least for untrusted sources, or use incrementalparsing withiterparse() instead of parsing the whole inputdocument into memory in one shot. That allows you to enforcesuitable limits on the input by applying semantic checks thatdetect and prevent an illegitimate use of your service. Ifpossible, you can also use this to reduce the amount of datathat you need to keep in memory while parsing the document,thus further reducing the possibility of an attacker to trickyour system into excessive resource usage.

Finally, please be aware that XPath suffers from the samevulnerability as SQL when it comes to content injection. Theobvious fix is to not build any XPath expressions via stringformatting or concatenation when the parameters may come fromuntrusted sources, and instead use XPath variables, whichsafely expose their values to the evaluation engine.

Thedefusedxml package comes with an example setup and a wrapperAPI for lxml that applies certain counter measures internally.

XPath and Document Traversal

What are the`findall()` and`xpath()` methods on Element(Tree)?

findall() is part of the originalElementTree API. It supports asimple subset of the XPath language, without predicates, conditions andother advanced features. It is very handy for finding specific tags in atree. Another important difference is namespace handling, which uses the{namespace}tagname notation. This is not supported by XPath. Thefindall, find and findtext methods are compatible with other ElementTreeimplementations and allow writing portable code that runs on ElementTree,cElementTree and lxml.etree.

xpath(), on the other hand, supports the complete power of the XPathlanguage, including predicates, XPath functions and Python extensionfunctions. The syntax is defined by theXPath specification. If you needthe expressiveness and selectivity of XPath, thexpath() method, theXPath class and theXPathEvaluator are the bestchoice.

Why doesn't`findall()` support full XPath expressions?

It was decided that it is more important to keep compatibility withElementTree to simplify code migration between the libraries. The maindifference compared to XPath is the{namespace}tagname notation used infindall(), which is not valid XPath.

ElementTree and lxml.etree use the same implementation, which assures 100%compatibility. Note thatfindall() isso fast in lxml that a nativeimplementation would not bring any performance benefits.

How can I find out which namespace prefixes are used in a document?

You can traverse the document (root.iter()) and collect the prefixattributes from all Elements into a set. However, it is unlikely that youreally want to do that. You do not need these prefixes, honestly. You onlyneed the namespace URIs. All namespace comparisons use these, so feel free tomake up your own prefixes when you use XPath expressions or extensionfunctions.

The only place where you might consider specifying prefixes is theserialization of Elements that were created through the API. Here, you canspecify a prefix mapping through thensmap argument when creating the rootElement. Its children will then inherit this prefix for serialization.

How can I specify a default namespace for XPath expressions?

You can't. In XPath, there is no such thing as a default namespace. Just usean arbitrary prefix and let the namespace dictionary of the XPath evaluatorsmap it to your namespace. See also the question above.

Movatterモバイル変換