Browse Topics Guided Learning Paths
Basics Intermediate Advanced

ai algorithms api best-practices career community databases data-science data-structures data-viz devops django docker editors flask front-end gamedev gui machine-learning news numpy projects python stdlib testing tools web-dev web-scraping

A Roadmap to XML Parsers in Python

byBartosz ZaczyńskiReading time estimate 1h 4mintermediate

Table of Contents

Remove ads

If you’ve ever tried to parse anXML document in Python before, then you know how surprisingly difficult such a task can be. On the one hand, theZen of Python promises only one obvious way to achieve your goal. At the same time, the standard library follows thebatteries included motto by letting you choose from not one but several XML parsers. Luckily, the Python community solved this surplus problem by creating even more XML parsing libraries.

Jokes aside, all XML parsers have their place in a world full of smaller or bigger challenges. It’s worthwhile to familiarize yourself with the available tools.

In this tutorial, you’ll learn how to:

Choose the right XMLparsing model
Use the XML parsers in thestandard library
Use major XML parsinglibraries
Parse XML documents declaratively usingdata binding
Use safe XML parsers to eliminatesecurity vulnerabilities

You can use this tutorial as aroadmap to guide you through the confusing world of XML parsers in Python. By the end of it, you’ll be able to pick the right XML parser for a given problem. To get the most out of this tutorial, you should already be familiar withXML and its building blocks, as well as how towork with files in Python.

Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.

Choose the Right XML Parsing Model

It turns out that you can process XML documents using a few language-agnostic strategies. Each demonstrates different memory and speed trade-offs, which can partially justify the wide range of XML parsers available in Python. In the following section, you’ll find out their differences and strengths.

Remove ads

Document Object Model (DOM)

Historically, the first and the most widespread model for parsing XML has been the DOM, or theDocument Object Model, originally defined by the World Wide Web Consortium (W3C). You might have already heard about the DOM because web browsers expose a DOM interface throughJavaScript to let you manipulate the HTML code of your websites. Both XML and HTML belong to the same family ofmarkup languages, which makes parsing XML with the DOM possible.

The DOM is arguably the most straightforward and versatile model to use. It defines a handful ofstandard operations for traversing and modifying document elements arranged in a hierarchy of objects. An abstract representation of the entire document tree is stored in memory, giving yourandom access to the individual elements.

While the DOM tree allows for fast andomnidirectional navigation, building its abstract representation in the first place can be time-consuming. Moreover, the XML getsparsed at once, as a whole, so it has to be reasonably small to fit the available memory. This renders the DOM suitable only for moderately large configuration files rather than multi-gigabyteXML databases.

Use a DOM parser when convenience is more important than processing time and when memory is not an issue. Some typical use cases are when you need to parse a relatively small document or when you only need to do the parsing infrequently.

Simple API for XML (SAX)

To address the shortcomings of the DOM, the Java community came up with a library through a collaborative effort, which then became an alternative model for parsing XML in other languages. There was no formal specification, only organic discussions on a mailing list. The end result was anevent-based streaming API that operates sequentially on individual elements rather than the whole tree.

Elements are processed from top to bottom in the same order they appear in the document. The parser triggers user-definedcallbacks to handle specific XML nodes as it finds them in the document. This approach is known as“push” parsing because elements are pushed to your functions by the parser.

SAX also lets you discard elements if you’re not interested in them. This means it has a much lower memory footprint than DOM and can deal with arbitrarily large files, which is great forsingle-pass processing such as indexing, conversion to other formats, and so on.

However, finding or modifying random tree nodes is cumbersome because it usually requires multiple passes on the document and tracking the visited nodes. SAX is also inconvenient for handling deeply nested elements. Finally, the SAX model just allows forread-only parsing.

In short, SAX is cheap in terms of space and time but more difficult to use than DOM in most cases. It works well for parsing very large documents or parsing incoming XML data in real time.

Streaming API for XML (StAX)

Although somewhat less popular in Python, this third approach to parsing XML builds on top of SAX. It extends the idea ofstreaming but uses a“pull” parsing model instead, which gives you more control. You can think of StAX as aniterator advancing acursor object through an XML document, where custom handlers call the parser on demand and not the other way around.

Note: It’s possible to combine more than one XML parsing model. For example, you can use SAX or StAX to quickly find an interesting piece of data in the document and then build a DOM representation of only that particular branch in memory.

Using StAX gives you more control over the parsing process and allows for more convenientstate management. The events in the stream are only consumed when requested, enablinglazy evaluation. Other than that, its performance should be on par with SAX, depending on the parser implementation.

Learn About XML Parsers in Python’s Standard Library

In this section, you’ll take a look at Python’s built-in XML parsers, which are available to you in nearly every Python distribution. You’re going to compare those parsers against a sampleScalable Vector Graphics (SVG) image, which is an XML-based format. By processing the same document with different parsers, you’ll be able to choose the one that suits you best.

The sample image, which you’re about to save in a local file for reference, depicts a smiley face. It consists of the following XML content:

XML

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [    <!ENTITY custom_entity "Hello">]><svgxmlns="http://www.w3.org/2000/svg"xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"viewBox="-105 -100 210 270"width="210"height="270"><inkscape:customx="42"inkscape:z="555">Somevalue</inkscape:custom><defs><linearGradientid="skin"x1="0"x2="0"y1="0"y2="1"><stopoffset="0%"stop-color="yellow"stop-opacity="1.0"/><stopoffset="75%"stop-color="gold"stop-opacity="1.0"/><stopoffset="100%"stop-color="orange"stop-opacity="1"/></linearGradient></defs><gid="smiley"inkscape:groupmode="layer"inkscape:label="Smiley"><!-- Head --><circlecx="0"cy="0"r="50"fill="url(#skin)"stroke="orange"stroke-width="2"/><!-- Eyes --><ellipsecx="-20"cy="-10"rx="6"ry="8"fill="black"stroke="none"/><ellipsecx="20"cy="-10"rx="6"ry="8"fill="black"stroke="none"/><!-- Mouth --><pathd="M-20 20 A25 25 0 0 0 20 20"fill="white"stroke="black"stroke-width="3"/></g><textx="-40"y="75">&custom_entity;&lt;svg&gt;!</text><script><![CDATA[      console.log("CDATA disables XML parsing: <svg>")      const smiley = document.getElementById("smiley")      const eyes = document.querySelectorAll("ellipse")      const setRadius = r => e => eyes.forEach(x => x.setAttribute("ry", r))      smiley.addEventListener("mouseenter", setRadius(2))      smiley.addEventListener("mouseleave", setRadius(8))    ]]></script></svg>

It starts with anXML declaration, followed by aDocument Type Definition (DTD) and the<svg>root element. The DTD is optional, but it can help validate your document structure if you decide to use an XML validator. The root element specifies thedefault namespacexmlns as well as aprefixed namespacexmlns:inkscape for editor-specific elements and attributes. The document also contains:

Nested elements
Attributes
Comments
Character data (CDATA)
Predefined and custom entities

Go ahead, save the XML in a file namedsmiley.svg, and open it using a modern web browser, which will run the JavaScript snippet present at the end:

The code adds an interactive component to the image. When you hover the mouse over the smiley face, it blinks its eyes. If you want to edit the smiley face using a convenient graphical user interface (GUI), then you can open the file using a vector graphics editor such asAdobe Illustrator orInkscape.

Note: Unlike JSON or YAML, some features of XML can be exploited by hackers. The standard XML parsers available in thexml package in Python are insecure and vulnerable to anarray of attacks. To safely parse XML documents from an untrusted source, prefer secure alternatives. You can jump to thelast section in this tutorial for more details.

It’s worth noting that Python’s standard library definesabstract interfaces for parsing XML documents while letting you supply concrete parser implementation. In practice, you rarely do that because Python bundles a binding for theExpat library, which is a widely used open-source XML parser written in C. All of the following Python modules in the standard library use Expat under the hood by default.

Unfortunately, while the Expat parser can tell you if your document iswell-formed, it can’tvalidate the structure of your documents against anXML Schema Definition (XSD) or aDocument Type Definition (DTD). For that, you’ll have to use one of the third-party libraries discussed later.

Remove ads

`xml.dom.minidom`: Minimal DOM Implementation

Considering that parsing XML documents using the DOM is arguably the most straightforward, you won’t be that surprised to find a DOM parser in the Python standard library. What is surprising, though, is that there are actually two DOM parsers.

Thexml.dom package houses two modules to work with DOM in Python:

xml.dom.minidom
xml.dom.pulldom

The first is a stripped-down implementation of the DOM interface conforming to a relatively old version of the W3C specification. It provides common objects defined by the DOM API such asDocument,Element, andAttr. This module is poorly documented and has quite limited usefulness, as you’re about to find out.

The second module has a slightly misleading name because it defines astreaming pull parser, which canoptionally produce a DOM representation of the current node in the document tree. You’ll find more information about thepulldom parserlater.

There are two functions inminidom that let you parse XML data from various data sources. One accepts either a filename or afile object, while another one expects aPython string:

Python

>>>fromxml.dom.minidomimportparse,parseString>>># Parse XML from a filename>>>document=parse("smiley.svg")>>># Parse XML from a file object>>>withopen("smiley.svg")asfile:...document=parse(file)...>>># Parse XML from a Python string>>>document=parseString("""\...<svg viewBox="-105 -100 210 270">...  <!-- More content goes here... -->...</svg>...""")

Thetriple-quoted string helps embed a multiline string literal without using the continuation character (\) at the end of each line. In any case, you’ll end up with aDocument instance, which exhibits the familiar DOM interface, letting you traverse the tree.

Apart from that, you’ll be able to access the XML declaration, DTD, and the root element:

Python

>>>document=parse("smiley.svg")>>># XML Declaration>>>document.version,document.encoding,document.standalone('1.0', 'UTF-8', False)>>># Document Type Definition (DTD)>>>dtd=document.doctype>>>dtd.entities["custom_entity"].childNodes[<DOM Text node "'Hello'">]>>># Document Root>>>document.documentElement<DOM Element: svg at 0x7fc78c62d790>

As you can see, even though the default XML parser in Python can’t validate documents, it still lets you inspect.doctype, the DTD, if it’s present. Note that the XML declaration and DTD are optional. If the XML declaration or a given XML attribute is missing, then the corresponding Python attributes will beNone.

To find an element by ID, you must use theDocument instance rather than a specific parentElement. The sample SVG image has two nodes with anid attribute, but you can’t find either of them:

Python

>>>document.getElementById("skin")isNoneTrue>>>document.getElementById("smiley")isNoneTrue

That may be surprising for someone who has only worked with HTML and JavaScript but hasn’t worked with XML before. While HTML defines the semantics for certain elements and attributes such as<body> orid, XML doesn’t attach any meaning to its building blocks. You need to mark an attribute as an ID explicitly using DTD or by calling.setIdAttribute() in Python, for example:

Definition Style	Implementation
DTD	`<!ATTLIST linearGradient id ID #IMPLIED>`
Python	`linearGradient.setIdAttribute("id")`

However, using a DTD isn’t enough to fix the problem if your document has a default namespace, which is the case for the sample SVG image. To address this, you can visit all elementsrecursively in Python, check whether they have theid attribute, and indicate it as their ID in one go:

Python

>>>fromxml.dom.minidomimportparse,Node>>>defset_id_attribute(parent,attribute_name="id"):...ifparent.nodeType==Node.ELEMENT_NODE:...ifparent.hasAttribute(attribute_name):...parent.setIdAttribute(attribute_name)...forchildinparent.childNodes:...set_id_attribute(child,attribute_name)...>>>document=parse("smiley.svg")>>>set_id_attribute(document)

Your customset_id_attribute() function takes a parent element and an optional name for the identity attribute, which defaults to"id". When you call that function on your SVG document, then all children elements that have anid attribute will become accessible through the DOM API:

Python

>>>document.getElementById("skin")<DOM Element: linearGradient at 0x7f82247703a0>>>>document.getElementById("smiley")<DOM Element: g at 0x7f8224770940>

Now, you’re getting the expected XML element corresponding to theid attribute’s value.

Using an ID allows for finding at most one unique element, but you can also find a collection of similar elements by theirtag name. Unlike the.getElementById() method, you can call.getElementsByTagName() on the document or a particular parent element to reduce the search scope:

Python

>>>document.getElementsByTagName("ellipse")[    <DOM Element: ellipse at 0x7fa2c944f430>,    <DOM Element: ellipse at 0x7fa2c944f4c0>]>>>root=document.documentElement>>>root.getElementsByTagName("ellipse")[    <DOM Element: ellipse at 0x7fa2c944f430>,    <DOM Element: ellipse at 0x7fa2c944f4c0>]

Notice that.getElementsByTagName() always returns alist of elements instead of a single element orNone. Forgetting about it when you switch between both methods is a common source of errors.

Unfortunately, elements like<inkscape:custom> that areprefixed with a namespace identifier won’t be included. They must be searched using.getElementsByTagNameNS(), which expects different arguments:

Python

>>>document.getElementsByTagNameNS(..."http://www.inkscape.org/namespaces/inkscape",..."custom"...)...[<DOM Element: inkscape:custom at 0x7f97e3f2a3a0>]>>>document.getElementsByTagNameNS("*","custom")[<DOM Element: inkscape:custom at 0x7f97e3f2a3a0>]

The first argument must be the XML namespace, which typically has the form of adomain name, while the second argument is the tag name. Notice that the namespace prefix is irrelevant! To search all namespaces, you can provide a wildcard character (*).

Note: To find the namespaces declared in your XML document, you can check out the root element’s attributes. In theory, they could be declared on any element, but the top-level one is where you’d usually find them.

Once you locate the element you’re interested in, you may use it to walk over the tree. However, another jarring quirk withminidom is how it handleswhitespace characters between elements:

Python

>>>element=document.getElementById("smiley")>>>element.parentNode<DOM Element: svg at 0x7fc78c62d790>>>>element.firstChild<DOM Text node "'\n    '">>>>element.lastChild<DOM Text node "'\n  '">>>>element.nextSibling<DOM Text node "'\n  '">>>>element.previousSibling<DOM Text node "'\n  '">

The newline characters and leading indentation are captured as separate tree elements, which is what the specification requires. Some parsers let you ignore these, but not the Python one. What you can do, however, is collapse whitespace in such nodes manually:

Python

>>>defremove_whitespace(node):...ifnode.nodeType==Node.TEXT_NODE:...ifnode.nodeValue.strip()=="":...node.nodeValue=""...forchildinnode.childNodes:...remove_whitespace(child)...>>>document=parse("smiley.svg")>>>set_id_attribute(document)>>>remove_whitespace(document)>>>document.normalize()

Note that you also have to.normalize() the document to combine adjacent text nodes. Otherwise, you could end up with a bunch of redundant XML elements with just whitespace. Again, recursion is the only way to visit tree elements since you can’t iterate over the document and its elements with a loop. Finally, this should give you the expected result:

Python

>>>element=document.getElementById("smiley")>>>element.parentNode<DOM Element: svg at 0x7fc78c62d790>>>>element.firstChild<DOM Comment node "' Head '">>>>element.lastChild<DOM Element: path at 0x7f8beea0f670>>>>element.nextSibling<DOM Element: text at 0x7f8beea0f700>>>>element.previousSibling<DOM Element: defs at 0x7f8beea0f160>>>>element.childNodes[    <DOM Comment node "' Head '">,    <DOM Element: circle at 0x7f8beea0f4c0>,    <DOM Comment node "' Eyes '">,    <DOM Element: ellipse at 0x7fa2c944f430>,    <DOM Element: ellipse at 0x7fa2c944f4c0>,    <DOM Comment node "' Mouth '">,    <DOM Element: path at 0x7f8beea0f670>]

Elements expose a few helpful methods and properties to let you query their details:

Python

>>>element=document.getElementsByTagNameNS("*","custom")[0]>>>element.prefix'inkscape'>>>element.tagName'inkscape:custom'>>>element.attributes<xml.dom.minidom.NamedNodeMap object at 0x7f6c9d83ba80>>>>dict(element.attributes.items()){'x': '42', 'inkscape:z': '555'}>>>element.hasChildNodes()True>>>element.hasAttributes()True>>>element.hasAttribute("x")True>>>element.getAttribute("x")'42'>>>element.getAttributeNode("x")<xml.dom.minidom.Attr object at 0x7f82244a05f0>>>>element.getAttribute("missing-attribute")''

For instance, you can check an element’s namespace, tag name, or attributes. If you ask for a missing attribute, then you’ll get an empty string ('').

Dealing with namespaced attributes isn’t much different. You just have to remember to prefix the attribute name accordingly or provide the domain name:

Python

>>>element.hasAttribute("z")False>>>element.hasAttribute("inkscape:z")True>>>element.hasAttributeNS(..."http://www.inkscape.org/namespaces/inkscape",..."z"...)...True>>>element.hasAttributeNS("*","z")False

Strangely enough, the wildcard character (*) doesn’t work here as it did with the.getElementsByTagNameNS() method before.

Since this tutorial is only about XML parsing, you’ll need to check theminidom documentation for methods that modify the DOM tree. They mostly follow the W3C specification.

As you can see, theminidom module isn’t terribly convenient. Its main advantage comes from being part of the standard library, which means you don’t have to install any external dependencies in your project to work with the DOM.

Remove ads

`xml.sax`: The SAX Interface for Python

To start working with SAX in Python, you can use the sameparse() andparseString() convenience functions as before, but from thexml.sax package instead. You also have to provide at least one more required argument, which must be acontent handler instance. In the spirit of Java, you provide one by subclassing a specific base class:

Python

fromxml.saximportparsefromxml.sax.handlerimportContentHandlerclassSVGHandler(ContentHandler):passparse("smiley.svg",SVGHandler())

The content handler receives astream of events corresponding to elements in your document as it’s being parsed. Running this code won’t do anything useful yet because your handler class is empty. To make it work, you’ll need to overload one or morecallback methods from the superclass.

Fire up your favorite editor, type the following code, and save it in a file namedsvg_handler.py:

Python

# svg_handler.pyfromxml.sax.handlerimportContentHandlerclassSVGHandler(ContentHandler):defstartElement(self,name,attrs):print(f"BEGIN: <{name}>,{attrs.keys()}")defendElement(self,name):print(f"END: </{name}>")defcharacters(self,content):ifcontent.strip()!="":print("CONTENT:",repr(content))

This modified content handlerprints out a few events onto the standard output. The SAX parser will call these three methods for you in response to finding the start tag, end tag, and some text between them. When you open an interactive session of the Python interpreter, import your content handler and give it a test drive. It should produce the following output:

Python

>>>fromxml.saximportparse>>>fromsvg_handlerimportSVGHandler>>>parse("smiley.svg",SVGHandler())BEGIN: <svg>, ['xmlns', 'xmlns:inkscape', 'viewBox', 'width', 'height']BEGIN: <inkscape:custom>, ['x', 'inkscape:z']CONTENT: 'Some value'END: </inkscape:custom>BEGIN: <defs>, []BEGIN: <linearGradient>, ['id', 'x1', 'x2', 'y1', 'y2']BEGIN: <stop>, ['offset', 'stop-color', 'stop-opacity']END: </stop>⋮

That’s essentially theobserver design pattern, which lets you translate XML into another hierarchical format incrementally. Say you wanted to convert that SVG file into a simplifiedJSON representation. First, you’ll want to store your content handler object in a separate variable to extract information from it later:

Python

>>>fromxml.saximportparse>>>fromsvg_handlerimportSVGHandler>>>handler=SVGHandler()>>>parse("smiley.svg",handler)

Since the SAX parser emits events without providing any context about the element it’s found, you need to keep track of where you are in the tree. Therefore, it makes sense to push and pop the current element onto astack, which you can simulate through a regularPython list. You may also define a helper property.current_element that will return the last element placed on the top of the stack:

Python

# svg_handler.py# ...classSVGHandler(ContentHandler):def__init__(self):super().__init__()self.element_stack=[]@propertydefcurrent_element(self):returnself.element_stack[-1]# ...

When the SAX parser finds a new element, you can immediately capture its tag name and attributes while making placeholders for children elements and the value, both of which are optional. For now, you can store every element as adict object. Replace your existing.startElement() method with a new implementation:

Python

# svg_handler.py# ...classSVGHandler(ContentHandler):# ...defstartElement(self,name,attrs):self.element_stack.append({"name":name,"attributes":dict(attrs),"children":[],"value":""})

The SAX parser gives you attributes as amapping that you can convert to a plainPython dictionary with a call to thedict() function. The element value is often spread over multiple pieces that you can concatenate using the plus operator (+) or a corresponding augmented assignment statement:

Python

# svg_handler.py# ...classSVGHandler(ContentHandler):# ...defcharacters(self,content):self.current_element["value"]+=content

Aggregating text in such a way will ensure that multiline content ends up in the current element. For example, the<script> tag in the sample SVG file contains six lines of JavaScript code, which trigger separate calls to thecharacters() callback.

Finally, once the parser stumbles on a closing tag, you can pop the current element from the stack and append it to its parent’s children. If there’s only one element left, then it will be your document’s root that you should keep for later. Other than that, you might want to clean the current element by removing keys with empty values:

Python

# svg_handler.py# ...classSVGHandler(ContentHandler):# ...defendElement(self,name):clean(self.current_element)iflen(self.element_stack)>1:child=self.element_stack.pop()self.current_element["children"].append(child)defclean(element):element["value"]=element["value"].strip()forkeyin("attributes","children","value"):ifnotelement[key]:delelement[key]

Note thatclean() is a function defined outside of the class body. Cleaning must be done at the end since there’s no way of knowing up front how many text pieces to concatenate there might be. You can expand the collapsible section below for a complete content handler’s code.

Python

# svg_handler.pyfromxml.sax.handlerimportContentHandlerclassSVGHandler(ContentHandler):def__init__(self):super().__init__()self.element_stack=[]@propertydefcurrent_element(self):returnself.element_stack[-1]defstartElement(self,name,attrs):self.element_stack.append({"name":name,"attributes":dict(attrs),"children":[],"value":""})defendElement(self,name):clean(self.current_element)iflen(self.element_stack)>1:child=self.element_stack.pop()self.current_element["children"].append(child)defcharacters(self,content):self.current_element["value"]+=contentdefclean(element):element["value"]=element["value"].strip()forkeyin("attributes","children","value"):ifnotelement[key]:delelement[key]

Now, it’s time to put everything to the test by parsing the XML, extracting the root element from your content handler, and dumping it to a JSON string:

Python

>>>fromxml.saximportparse>>>fromsvg_handlerimportSVGHandler>>>handler=SVGHandler()>>>parse("smiley.svg",handler)>>>root=handler.current_element>>>importjson>>>print(json.dumps(root,indent=4)){    "name": "svg",    "attributes": {        "xmlns": "http://www.w3.org/2000/svg",        "xmlns:inkscape": "http://www.inkscape.org/namespaces/inkscape",        "viewBox": "-105 -100 210 270",        "width": "210",        "height": "270"    },    "children": [        {            "name": "inkscape:custom",            "attributes": {                "x": "42",                "inkscape:z": "555"            },            "value": "Some value"        },⋮

It’s worth noting that this implementation has no memory gain over DOM because it builds an abstract representation of the whole document just as before. The difference is that you’ve made a custom dictionary representation instead of the standard DOM tree. However, you could imagine writing directly to a file or a database instead of memory while receiving SAX events. That would effectively lift your computer memory limit.

If you want to parse XML namespaces, then you’ll need to create and configure the SAX parser yourself with a bit of boilerplate code and also implement slightly different callbacks:

Python

# svg_handler.pyfromxml.sax.handlerimportContentHandlerclassSVGHandler(ContentHandler):defstartPrefixMapping(self,prefix,uri):print(f"startPrefixMapping:{prefix=},{uri=}")defendPrefixMapping(self,prefix):print(f"endPrefixMapping:{prefix=}")defstartElementNS(self,name,qname,attrs):print(f"startElementNS:{name=}")defendElementNS(self,name,qname):print(f"endElementNS:{name=}")

These callbacks receive additional parameters about the element’s namespace. To make the SAX parser actually trigger those callbacks instead of some of the earlier ones, you must explicitly enableXML namespace support:

Python

>>>fromxml.saximportmake_parser>>>fromxml.sax.handlerimportfeature_namespaces>>>fromsvg_handlerimportSVGHandler>>>parser=make_parser()>>>parser.setFeature(feature_namespaces,True)>>>parser.setContentHandler(SVGHandler())>>>parser.parse("smiley.svg")startPrefixMapping: prefix=None, uri='http://www.w3.org/2000/svg'startPrefixMapping: prefix='inkscape', uri='http://www.inkscape.org/namespaces/inkscape'startElementNS: name=('http://www.w3.org/2000/svg', 'svg')⋮endElementNS: name=('http://www.w3.org/2000/svg', 'svg')endPrefixMapping: prefix='inkscape'endPrefixMapping: prefix=None

Setting this feature turns the elementname into a tuple comprised of the namespace’s domain name and the tag name.

Thexml.sax package offers a decent event-based XML parser interface modeled after the original Java API. It’s somewhat limited compared to the DOM but should be enough to implement a basic XML streaming push parser without resorting to third-party libraries. With this in mind, there’s a less verbose pull parser available in Python, which you’ll explore next.

Remove ads

`xml.dom.pulldom`: Streaming Pull Parser

The parsers in the Python standard library often work together. For example, thexml.dom.pulldom module wraps the parser fromxml.sax to take advantage of buffering and read the document in chunks. At the same time, it uses the default DOM implementation fromxml.dom.minidom for representing document elements. However, those elements are processed one at a time without bearing any relationship until you ask for it explicitly.

Note: The XML namespace support is enabled by default inxml.dom.pulldom.

While the SAX model follows theobserver pattern, you can think of StAX as theiterator design pattern, which lets you loop over aflat stream of events. Once again, you can call the familiarparse() orparseString() functions imported from the module to parse the SVG image:

Python

>>>fromxml.dom.pulldomimportparse>>>event_stream=parse("smiley.svg")>>>forevent,nodeinevent_stream:...print(event,node)...START_DOCUMENT <xml.dom.minidom.Document object at 0x7f74f9283e80>START_ELEMENT <DOM Element: svg at 0x7f74fde18040>CHARACTERS <DOM Text node "'\n'">⋮END_ELEMENT <DOM Element: script at 0x7f74f92b3c10>CHARACTERS <DOM Text node "'\n'">END_ELEMENT <DOM Element: svg at 0x7f74fde18040>

It takes only a few lines of code to parse the document. The most striking difference betweenxml.sax andxml.dom.pulldom is the lack of callbacks since you drive the whole process. You have a lot more freedom in structuring your code, and you don’t need to useclasses if you don’t want to.

Notice that the XML nodes pulled from the stream have types defined inxml.dom.minidom. But if you were to check their parents, siblings, and children, then you’d find out they know nothing about each other:

Python

>>>fromxml.dom.pulldomimportparse,START_ELEMENT>>>event_stream=parse("smiley.svg")>>>forevent,nodeinevent_stream:...ifevent==START_ELEMENT:...print(node.parentNode,node.previousSibling,node.childNodes)<xml.dom.minidom.Document object at 0x7f90864f6e80> None []None None []None None []None None []⋮

The relevant attributes are empty. Anyway, the pull parser can help in a hybrid approach to quickly look up some parent element and build a DOM tree only for the branch rooted in it:

Python

fromxml.dom.pulldomimportparse,START_ELEMENTdefprocess_group(parent):left_eye,right_eye=parent.getElementsByTagName("ellipse")# ...event_stream=parse("smiley.svg")forevent,nodeinevent_stream:ifevent==START_ELEMENT:ifnode.tagName=="g":event_stream.expandNode(node)process_group(node)

By calling.expandNode() on the event stream, you essentially move the iterator forward and parse XML nodes recursively until finding the matching closing tag of the parent element. The resulting node will have children with properly initialized attributes. Moreover, you’ll be able to use the DOM methods on them.

The pull parser offers an interesting alternative to DOM and SAX by combining the best of both worlds. It’s efficient, flexible, and straightforward to use, leading to more compact and readable code. You could also use it to process multiple XML files at the same time more easily. That said, none of the XML parsers mentioned so far can match the elegance, simplicity, and completeness of the last one to arrive in Python’s standard library.

`xml.etree.ElementTree`: A Lightweight, Pythonic Alternative

The XML parsers you’ve come to know so far get the job done. However, they don’t fit Python’s philosophy very well, and that’s no accident. While DOM follows the W3C specification and SAX was modeled after a Java API, neither feels particularly Pythonic.

Even worse, both DOM and SAX parsers feel antiquated as some of their code in theCPython interpreter hasn’t changed for more than two decades! At the time of writing this, their implementation is still incomplete and hasmissing typeshed stubs, which breaks code completion incode editors.

Meanwhile, Python 2.5 brought a fresh perspective on parsingand writing XML documents—theElementTree API. It’s a lightweight, efficient, elegant, and feature-rich interface that even some third-party libraries build on. To get started with it, you must import thexml.etree.ElementTree module, which is a bit of a mouthful. Therefore, it’s customary to define analias like this:

Python

importxml.etree.ElementTreeasET

In slightly older code, you may have seen thecElementTree module imported instead. It was an implementation several times faster than the same interface written in C. Today, the regular module uses the fast implementation whenever possible, so you don’t need to bother anymore.

You can use the ElementTree API by employing different parsing strategies:

	Non-incremental	Incremental (Blocking)	Incremental (Non-blocking)
`ET.parse()`	✔️
`ET.fromstring()`	✔️
`ET.iterparse()`		✔️
`ET.XMLPullParser`			✔️

The non-incremental strategy loads up the entire document into memory in aDOM-like fashion. There are two appropriately named functions in the module that allow for parsing a file or a Python string with XML content:

Python

>>>importxml.etree.ElementTreeasET>>># Parse XML from a filename>>>ET.parse("smiley.svg")<xml.etree.ElementTree.ElementTree object at 0x7fa4c980a6a0>>>># Parse XML from a file object>>>withopen("smiley.svg")asfile:...ET.parse(file)...<xml.etree.ElementTree.ElementTree object at 0x7fa4c96df340>>>># Parse XML from a Python string>>>ET.fromstring("""\...<svg viewBox="-105 -100 210 270">...  <!-- More content goes here... -->...</svg>...""")<Element 'svg' at 0x7fa4c987a1d0>

Parsing a file object or a filename withparse() returns an instance of theET.ElementTree class, which represents the whole element hierarchy. On the other hand, parsing a string withfromstring() will return the specific rootET.Element.

Alternatively, you can read the XML document incrementally with a streamingpull parser, which yields a sequence of events and elements:

Python

>>>forevent,elementinET.iterparse("smiley.svg"):...print(event,element.tag)...end {http://www.inkscape.org/namespaces/inkscape}customend {http://www.w3.org/2000/svg}stopend {http://www.w3.org/2000/svg}stopend {http://www.w3.org/2000/svg}stopend {http://www.w3.org/2000/svg}linearGradient⋮

By default,iterparse() emits only theend events associated with the closing XML tag. However, you can subscribe to other events as well. You can find them with string constants such as"comment":

Python

>>>importxml.etree.ElementTreeasET>>>forevent,elementinET.iterparse("smiley.svg",["comment"]):...print(element.text.strip())...HeadEyesMouth

Here’s a list of all the available event types:

start: Start of an element
end: End of an element
comment: Comment element
pi: Processing instruction, as inXSL
start-ns: Start of a namespace
end-ns: End of a namespace

The downside ofiterparse() is that it usesblocking calls to read the next chunk of data, which might be unsuitable forasynchronous code running on a single thread of execution. To alleviate that, you can look intoXMLPullParser, which is a little bit more verbose:

Python

importxml.etree.ElementTreeasETasyncdefreceive_data(url):"""Download chunks of bytes from the URL asynchronously."""yieldb"<svg "yieldb"viewBox=\"-105 -100 210 270\""yieldb"></svg>"asyncdefparse(url,events=None):parser=ET.XMLPullParser(events)asyncforchunkinreceive_data(url):parser.feed(chunk)forevent,elementinparser.read_events():yieldevent,element

This hypothetical example feeds the parser with chunks of XML that can arrive a few seconds apart. Once there’s enough content, you can iterate over a sequence of events and elements buffered by the parser. Thisnon-blocking incremental parsing strategy allows for a truly concurrent parsing of multiple XML documents on the fly while you download them.

Elements in the tree are mutable, iterable, and indexablesequences. They have a length corresponding to the number of their immediate children:

Python

>>>importxml.etree.ElementTreeasET>>>tree=ET.parse("smiley.svg")>>>root=tree.getroot()>>># The length of an element equals the number of its children.>>>len(root)5>>># The square brackets let you access a child by an index.>>>root[1]<Element '{http://www.w3.org/2000/svg}defs' at 0x7fe05d2e8860>>>>root[2]<Element '{http://www.w3.org/2000/svg}g' at 0x7fa4c9848400>>>># Elements are mutable. For example, you can swap their children.>>>root[2],root[1]=root[1],root[2]>>># You can iterate over an element's children.>>>forchildinroot:...print(child.tag)...{http://www.inkscape.org/namespaces/inkscape}custom{http://www.w3.org/2000/svg}g{http://www.w3.org/2000/svg}defs{http://www.w3.org/2000/svg}text{http://www.w3.org/2000/svg}script

Tag names might be prefixed with an optional namespace enclosed in a pair of curly braces ({}). The default XML namespace appears there, too, when defined. Notice how the swap assignment in the highlighted line made the<g> element come before<defs>. This shows the mutable nature of the sequence.

Here are a few more element attributes and methods that are worth mentioning:

Python

>>>element=root[0]>>>element.tag'{http://www.inkscape.org/namespaces/inkscape}custom'>>>element.text'Some value'>>>element.attrib{'x': '42', '{http://www.inkscape.org/namespaces/inkscape}z': '555'}>>>element.get("x")'42'

One of the benefits of this API is how it uses Python’s native data types. Above, it uses a Python dictionary for the element’s attributes. In the previous modules, those were wrapped in less convenient adapters. Unlike the DOM, the ElementTree API doesn’t expose methods or properties for walking over the tree in any direction, but there are a couple of better alternatives.

As you’ve seen before, instances of theElement class implement thesequence protocol, letting you iterate over their direct children with a loop:

Python

>>>forchildinroot:...print(child.tag)...{http://www.inkscape.org/namespaces/inkscape}custom{http://www.w3.org/2000/svg}defs{http://www.w3.org/2000/svg}g{http://www.w3.org/2000/svg}text{http://www.w3.org/2000/svg}script

You get the sequence of the root’s immediate children. To go deeper into nested descendants, however, you’ll have to call the.iter() method on the ancestor element:

Python

>>>fordescendantinroot.iter():...print(descendant.tag)...{http://www.w3.org/2000/svg}svg{http://www.inkscape.org/namespaces/inkscape}custom{http://www.w3.org/2000/svg}defs{http://www.w3.org/2000/svg}linearGradient{http://www.w3.org/2000/svg}stop{http://www.w3.org/2000/svg}stop{http://www.w3.org/2000/svg}stop{http://www.w3.org/2000/svg}g{http://www.w3.org/2000/svg}circle{http://www.w3.org/2000/svg}ellipse{http://www.w3.org/2000/svg}ellipse{http://www.w3.org/2000/svg}path{http://www.w3.org/2000/svg}text{http://www.w3.org/2000/svg}script

The root element has only five children but thirteen descendants in total. It’s also possible to narrow down the descendants byfiltering only specific tag names using an optionaltag argument:

Python

>>>tag_name="{http://www.w3.org/2000/svg}ellipse">>>fordescendantinroot.iter(tag_name):...print(descendant)...<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa03b0><Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa0450>

This time, you only got two<ellipse> elements. Remember to include theXML namespace, such as{http://www.w3.org/2000/svg}, in your tag name—as long as it’s been defined. Otherwise, if you only provide the tag name without the right namespace, you could end up with fewer or more descendant elements than initially anticipated.

Dealing with namespaces is more convenient when using.iterfind(), which accepts an optional mapping of prefixes to domain names. To indicate thedefault namespace, you can leave the key blank or assign an arbitrary prefix, which must be used in the tag name later:

Python

>>>namespaces={..."":"http://www.w3.org/2000/svg",..."custom":"http://www.w3.org/2000/svg"...}>>>fordescendantinroot.iterfind("g",namespaces):...print(descendant)...<Element '{http://www.w3.org/2000/svg}g' at 0x7f430baa0270>>>>fordescendantinroot.iterfind("custom:g",namespaces):...print(descendant)...<Element '{http://www.w3.org/2000/svg}g' at 0x7f430baa0270>

The namespace mapping lets you refer to the same element with different prefixes. Surprisingly, if you try to find those nested<ellipse> elements like before, then.iterfind() won’t return anything because it expects anXPath expression rather than a simple tag name:

Python

>>>fordescendantinroot.iterfind("ellipse",namespaces):...print(descendant)...>>>fordescendantinroot.iterfind("g/ellipse",namespaces):...print(descendant)...<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa03b0><Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa0450>

By coincidence, the string"g" happens to be a valid path relative to the currentroot element, which is why the function returned a non-empty result before. However, to find the ellipses nested one level deeper in the XML hierarchy, you need a more verbose path expression.

ElementTree haslimited syntax support for theXPath mini-language, which you can use to query elements in XML, similar to CSS selectors in HTML. There are other methods that accept such an expression:

Python

>>>namespaces={"":"http://www.w3.org/2000/svg"}>>>root.iterfind("defs",namespaces)<generator object prepare_child.<locals>.select at 0x7f430ba6d190>>>>root.findall("defs",namespaces)[<Element '{http://www.w3.org/2000/svg}defs' at 0x7f430ba09e00>]>>>root.find("defs",namespaces)<Element '{http://www.w3.org/2000/svg}defs' at 0x7f430ba09e00>

While.iterfind() yields matching elements lazily,.findall() returns a list, and.find() returns only the first matching element. Similarly, you can extract text enclosed between the opening and closing tags of elements using.findtext() or get the inner text of the entire document with.itertext():

Python

>>>namespaces={"i":"http://www.inkscape.org/namespaces/inkscape"}>>>root.findtext("i:custom",namespaces=namespaces)'Some value'>>>fortextinroot.itertext():...iftext.strip()!="":...print(text.strip())...Some valueHello <svg>!console.log("CDATA disables XML parsing: <svg>")⋮

You look for text embedded in a specific XML element first, then everywhere in the whole document. Searching by text is a powerful feature of the ElementTree API. It’s possible to replicate it using other built-in parsers, but at the cost of increased code complexity and less convenience.

The ElementTree API is probably the most intuitive one of them all. It’s Pythonic, efficient, robust, and universal. Unless you have a specific reason to use DOM or SAX, this should be your default choice.

Remove ads

Explore Third-Party XML Parser Libraries

Occasionally, reaching for the XML parsers in the standard library might feel like picking up a sledgehammer to crack a nut. At other times, it’s the opposite, and you wish for a parser that could do much more. For example, you might want to validate XML against a schema or use advanced XPath expressions. In those situations, it’s best to check out the external libraries available onPyPI.

Below, you’ll find a selection of external libraries with varying degrees of complexity and sophistication.

`untangle`: Convert XML to a Python Object

If you’re looking for a one-liner that could turn your XML document into a Python object, then look no further. While it hasn’t been updated in a few years, theuntangle library might soon become your favorite way of parsing XML in Python. There’s only one function to remember, and it accepts a URL, a filename, a file object, or an XML string:

Python

>>>importuntangle>>># Parse XML from a URL>>>untangle.parse("http://localhost:8000/smiley.svg")Element(name = None, attributes = None, cdata = )>>># Parse XML from a filename>>>untangle.parse("smiley.svg")Element(name = None, attributes = None, cdata = )>>># Parse XML from a file object>>>withopen("smiley.svg")asfile:...untangle.parse(file)...Element(name = None, attributes = None, cdata = )>>># Parse XML from a Python string>>>untangle.parse("""\...<svg viewBox="-105 -100 210 270">...  <!-- More content goes here... -->...</svg>...""")Element(name = None, attributes = None, cdata = )

In each case, it returns an instance of theElement class. You can use thedot operator to access its children and thesquare bracket syntax to get XML attributes or one of the child nodes by index. To get the document’s root element, for example, you can access it as if it was the object’s property. To get one of the element’s XML attributes, you may pass its name as a dictionary key:

Python

>>>importuntangle>>>document=untangle.parse("smiley.svg")>>>document.svgElement(name = svg, attributes = {'xmlns': ...}, ...)>>>document.svg["viewBox"]'-105 -100 210 270'

There are no function or method names to remember. Instead, each parsed object is unique, so you really need to know the underlying XML document’s structure to traverse it withuntangle.

To find out what the root element’s name is, calldir() on the document:

Python

>>>dir(document)['svg']

This reveals the names of the element’s immediate children. Note thatuntangle redefines the meaning ofdir() for its parsed documents. Usually, you call this built-in function to inspect a class or a Python module. The default implementation would return a list of attribute names rather than the child elements of an XML document.

If there’s more than one child with the given tag name, then you can iterate over them with a loop or refer to one by index:

Python

>>>dir(document.svg)['defs', 'g', 'inkscape_custom', 'script', 'text']>>>dir(document.svg.defs.linearGradient)['stop', 'stop', 'stop']>>>forstopindocument.svg.defs.linearGradient.stop:...print(stop)...Element <stop> with attributes {'offset': ...}, ...Element <stop> with attributes {'offset': ...}, ...Element <stop> with attributes {'offset': ...}, ...>>>document.svg.defs.linearGradient.stop[1]Element(name = stop, attributes = {'offset': ...}, ...)

You might have noticed that the<inkscape:custom> element was renamed toinkscape_custom. Unfortunately, the library can’t handleXML namespaces well, so if that’s something you need to rely on, then you must look elsewhere.

Because of the dot notation, element names in XML documents must be validPython identifiers. If they’re not, thenuntangle will automatically rewrite their names by replacing forbidden characters with an underscore:

Python

>>>dir(untangle.parse("<com:company.web-app></com:company.web-app>"))['com_company_web_app']

Children’s tag names aren’t the only object properties you can access. Elements have a few predefined object attributes that might be shown by callingvars():

Python

>>>element=document.svg.text>>>list(vars(element).keys())['_name', '_attributes', 'children', 'is_root', 'cdata']>>>element._name'text'>>>element._attributes{'x': '-40', 'y': '75'}>>>element.children[]>>>element.is_rootFalse>>>element.cdata'Hello <svg>!'

Behind the scenes,untangle uses the built-in SAX parser, but because the library is implemented in pure Python and creates lots of heavyweight objects, it has considerablypoor performance. While it’s intended for reading tiny documents, you can still combine it with another approach to read multi-gigabyte XML files.

Here’s how. If you head over toWikipedia archives, you can download one of their compressed XML files. The one at the top should contain a snapshot of the articles’ abstracts:

XML

<feed><doc><title>Wikipedia:Anarchism</title><url>https://en.wikipedia.org/wiki/Anarchism</url><abstract>Anarchismisapoliticalphilosophy...</abstract><links><sublinklinktype="nav"><anchor>Etymology,terminologyanddefinition</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Etymology...</link></sublink><sublinklinktype="nav"><anchor>History</anchor><link>https://en.wikipedia.org/wiki/Anarchism#History</link></sublink>⋮</links></doc>⋮</feed>

It’s over 6 GB in size after download, which is perfect for this exercise. The idea is to scan the file to find the consecutive opening and closing<doc> tags and then parse the XML fragment between them usinguntangle for convenience.

The built-inmmap module lets you create avirtual view of the file contents, even when it doesn’t fit the available memory. This gives an impression of working with a huge string of bytes that supports searching and the regular slicing syntax. If you’re interested in how to encapsulate this logic in aPython class and take advantage of agenerator for lazy evaluation, then expand the collapsible section below.

Here’s the complete code of theXMLTagStream class:

Python

importmmapimportuntangleclassXMLTagStream:def__init__(self,path,tag_name,encoding="utf-8"):self.file=open(path)self.stream=mmap.mmap(self.file.fileno(),0,access=mmap.ACCESS_READ)self.tag_name=tag_nameself.encoding=encodingself.start_tag=f"<{tag_name}>".encode(encoding)self.end_tag=f"</{tag_name}>".encode(encoding)def__enter__(self):returnselfdef__exit__(self,*args,**kwargs):self.stream.close()self.file.close()def__iter__(self):end=0while(begin:=self.stream.find(self.start_tag,end))!=-1:end=self.stream.find(self.end_tag,begin)yieldself.parse(self.stream[begin:end+len(self.end_tag)])defparse(self,chunk):document=untangle.parse(chunk.decode(self.encoding))returngetattr(document,self.tag_name)

It’s a customcontext manager, which uses theiterator protocol defined as an inlinegenerator function. The resulting generator object loops over the XML document as if it was a long stream of characters.

Note that thewhile loop takes advantage of fairly new Python syntax, thewalrus operator (:=), to simplify the code. You can use this operator inassignment expressions, where an expression can be evaluated and assigned to a variable.

Without getting into the nitty-gritty details, here’s how you can use this custom class to go through a big XML file quickly while inspecting specific elements more thoroughly withuntangle:

Python

>>>withXMLTagStream("abstract.xml","doc")asstream:...fordocinstream:...print(doc.title.cdata.center(50,"="))...forsublinkindoc.links.sublink:...print("-",sublink.anchor.cdata)...if"q"==input("Press [q] to exit or any key to continue..."):...break...===============Wikipedia: Anarchism===============- Etymology, terminology and definition- History- Pre-modern era⋮Press [q] to exit or any key to continue...================Wikipedia: Autism=================- Characteristics- Social development- Communication⋮Press [q] to exit or any key to continue...

First, you open a file for reading and indicate the tag name that you want to find. Then, you iterate over those elements and receive a parsed fragment of the XML document. It’s almost like looking through a tiny window moving over an infinitely long sheet of paper. That’s a relatively surface-level example that ignores a few details, but it should give you a general idea of how to use such a hybrid parsing strategy.

Remove ads

`xmltodict`: Convert XML to a Python Dictionary

If you like JSON but you’re not a fan of XML, then check outxmltodict, which tries to bridge the gap between both data formats. As the name implies, the library can parse an XML document and represent it as a Python dictionary, which also happens to be the target data type for JSON documents in Python. This makesconversion between XML and JSON possible.

Note: Dictionaries are made up of key-value pairs, while XML documents are inherently hierarchical, which may lead to some information loss during the conversion. On top of that, XML has attributes, comments, processing instructions, and other ways of defining metadata that aren’t available in dictionaries.

Unlike the rest of the XML parsers so far, this one expects either a Python string or a file-like object open for reading inbinary mode:

Python

>>>importxmltodict>>>xmltodict.parse("""\...<svg viewBox="-105 -100 210 270">...  <!-- More content goes here... -->...</svg>...""")OrderedDict([('svg', OrderedDict([('@viewBox', '-105 -100 210 270')]))])>>>withopen("smiley.svg","rb")asfile:...xmltodict.parse(file)...OrderedDict([('svg', ...)])

By default, the library returns an instance of theOrderedDict collection to retainelement order. However, starting from Python 3.6, plain dictionaries also keep the insertion order. If you’d like to work with regular dictionaries instead, then passdict as thedict_constructor argument to theparse() function:

Python

>>>importxmltodict>>>withopen("smiley.svg","rb")asfile:...xmltodict.parse(file,dict_constructor=dict)...{'svg': ...}

Now,parse() returns a plain old dictionary with a familiar textual representation.

To avoidname conflicts between XML elements and their attributes, the library automatically prefixes the latter with an@ character. You may also ignore attributes completely by setting thexml_attribs flag appropriately:

Python

>>>importxmltodict>>># Rename attributes by default>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file)...print([xforxindocument["svg"]ifx.startswith("@")])...['@xmlns', '@xmlns:inkscape', '@viewBox', '@width', '@height']>>># Ignore attributes when requested>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file,xml_attribs=False)...print([xforxindocument["svg"]ifx.startswith("@")])...[]

Yet another piece of information that gets ignored by default is theXML namespace declaration. These are treated like regular attributes, while the corresponding prefixes become part of the tag name. However, you can expand, rename, or skip some of the namespaces if you want to:

Python

>>>importxmltodict>>># Ignore namespaces by default>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file)...print(document.keys())...odict_keys(['svg'])>>># Process namespaces when requested>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file,process_namespaces=True)...print(document.keys())...odict_keys(['http://www.w3.org/2000/svg:svg'])>>># Rename and skip some namespaces>>>namespaces={..."http://www.w3.org/2000/svg":"svg",..."http://www.inkscape.org/namespaces/inkscape":None,...}>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(...file,process_namespaces=True,namespaces=namespaces...)...print(document.keys())...print("custom"indocument["svg:svg"])...print("inkscape:custom"indocument["svg:svg"])...odict_keys(['svg:svg'])TrueFalse

In the first example above, tag names don’t include the XML namespace prefix. In the second example, they do because you requested to process them. Finally, in the third example, you collapsed the default namespace tosvg while suppressing Inkscape’s namespace withNone.

The default string representation of a Python dictionary might not be legible enough. To improve its presentation, you canpretty-print it or convert it to another format such asJSON orYAML:

Python

>>>importxmltodict>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file,dict_constructor=dict)...>>>frompprintimportpprintaspp>>>pp(document){'svg': {'@height': '270',         '@viewBox': '-105 -100 210 270',         '@width': '210',         '@xmlns': 'http://www.w3.org/2000/svg',         '@xmlns:inkscape': 'http://www.inkscape.org/namespaces/inkscape',         'defs': {'linearGradient': {'@id': 'skin',         ⋮>>>importjson>>>print(json.dumps(document,indent=4,sort_keys=True)){    "svg": {        "@height": "270",        "@viewBox": "-105 -100 210 270",        "@width": "210",        "@xmlns": "http://www.w3.org/2000/svg",        "@xmlns:inkscape": "http://www.inkscape.org/namespaces/inkscape",        "defs": {            "linearGradient": {             ⋮>>>importyaml# Install first with 'pip install PyYAML'>>>print(yaml.dump(document))svg:  '@height': '270'  '@viewBox': -105 -100 210 270  '@width': '210'  '@xmlns': http://www.w3.org/2000/svg  '@xmlns:inkscape': http://www.inkscape.org/namespaces/inkscape  defs:    linearGradient:    ⋮

Thexmltodict library allows for converting the document the other way around—that is, from a Python dictionary back to an XML string:

Python

>>>importxmltodict>>>withopen("smiley.svg","rb")asfile:...document=xmltodict.parse(file,dict_constructor=dict)...>>>xmltodict.unparse(document)'<?xml version="1.0" encoding="utf-8"?>\n<svg...'

The dictionary may come in handy as an intermediate format when converting data from JSON or YAML to XML, should there be such a need.

There are a bunch more features in thexmltodict library, such as streaming, so feel free to explore them on your own. However, this library is a bit dated too. Besides, it’s the next library that should be on your radar if you’re really seeking advanced XML parsing features.

`lxml`: Use ElementTree on Steroids

If you want the best performance, the broadest spectrum of functionality, and the most familiar interface all wrapped in one package, then installlxml and forget about the rest of the libraries. It’s aPython binding for the C librarieslibxml2 andlibxslt, which support several standards, including XPath, XML Schema, and XSLT.

The library is compatible with Python’sElementTree API, which you learned about earlier in this tutorial. That means you can reuse your existing code by replacing only a single import statement:

Python

importlxml.etreeasET

This will give you a greatperformance boost. On top of that, thelxml library comes with an extensive set of features and provides different ways of using them. For example, it lets youvalidate your XML documents against several schema languages, one of which is the XML Schema Definition:

Python

>>>importlxml.etreeasET>>>xml_schema=ET.XMLSchema(...ET.fromstring("""\...        <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">...            <xsd:element name="parent"/>...            <xsd:complexType name="SomeType">...                <xsd:sequence>...                    <xsd:element name="child" type="xsd:string"/>...                </xsd:sequence>...            </xsd:complexType>...        </xsd:schema>"""))>>>valid=ET.fromstring("<parent><child></child></parent>")>>>invalid=ET.fromstring("<child><parent></parent></child>")>>>xml_schema.validate(valid)True>>>xml_schema.validate(invalid)False

None of the XML parsers in Python’s standard library have the capability to validate documents. Meanwhile,lxml lets you define anXMLSchema object and run documents through it while remaining largely compatible with the ElementTree API.

Besides the ElementTree API,lxml supports an alternativelxml.objectify interface, which you’ll cover later in thedata binding section.

Remove ads

`BeautifulSoup`: Deal With Malformed XML

You won’t typically use the last library in this comparison for parsing XML since you mostly encounter itweb scraping HTML documents. That said, it’s capable of parsing XML just as well.BeautifulSoup comes with apluggable architecture that lets you choose the underlying parser. Thelxml one described earlier is actually recommended by the official documentation and is currently the only XML parser supported by the library.

Depending on the kind of documents you’ll want to parse, the desired efficiency, and feature availability, you can select one of these parsers:

Document Type	Parser Name	Python Library	Speed
HTML	`"html.parser"`	-	Moderate
HTML	`"html5lib"`	`html5lib`	Slow
HTML	`"lxml"`	`lxml`	Fast
XML	`"lxml-xml"` or`"xml"`	`lxml`	Fast

Other than speed, there are noticeable differences between the individual parsers. For example, some of them are more forgiving than others when it comes to malformed elements, while others emulate web browsers better.

Fun Fact: The library’s name refers to thetag soup, which describes syntactically or structurally incorrect HTML code.

Assuming you’ve already installed thelxml andbeautifulsoup4 libraries into your activevirtual environment, you can start parsing XML documents right away. You only need to importBeautifulSoup:

Python

frombs4importBeautifulSoup# Parse XML from a file objectwithopen("smiley.svg")asfile:soup=BeautifulSoup(file,features="lxml-xml")# Parse XML from a Python stringsoup=BeautifulSoup("""\<svg viewBox="-105 -100 210 270">  <!-- More content goes here... --></svg>""",features="lxml-xml")

If you accidentally specified a different parser, saylxml, then the library would add missing HTML tags such as<body> to the parsed document for you. That probably isn’t what you intended in this case, so be careful when specifying the parser name.

BeautifulSoup is a powerful tool for parsing XML documents because it canhandle invalid content and it has arich API for extracting information. Have a look at how it copes with incorrectly nested tags, forbidden characters, and badly placed text:

Python

>>>frombs4importBeautifulSoup>>>soup=BeautifulSoup("""\...<parent>...    <child>Forbidden < character </parent>...    </child>...ignored...""",features="lxml-xml")>>>print(soup.prettify())<?xml version="1.0" encoding="utf-8"?><parent> <child>  Forbidden </child></parent>

A different parser would raise anexception and surrender as soon as it detected something wrong with the document. Here, not only did it ignore the problems, but it also figured out sensible ways to fix some of them. The elements are properly nested now and have no invalid content.

There are way too many methods of locating elements with BeautifulSoup to cover them all here. Usually, you’ll call a variant of.find() or.findall() on the soup element:

Python

>>>frombs4importBeautifulSoup>>>withopen("smiley.svg")asfile:...soup=BeautifulSoup(file,features="lxml-xml")...>>>soup.find_all("ellipse",limit=1)[<ellipse cx="-20" cy="-10" fill="black" rx="6" ry="8" stroke="none"/>]>>>soup.find(x=42)<inkscape:custom inkscape:z="555" x="42">Some value</inkscape:custom>>>>soup.find("stop",{"stop-color":"gold"})<stop offset="75%" stop-color="gold" stop-opacity="1.0"/>>>>soup.find(text=lambdax:"value"inx).parent<inkscape:custom inkscape:z="555" x="42">Some value</inkscape:custom>

Thelimit parameter is similar to theLIMIT clause in MySQL, which lets you decide how many results you want to receive at most. It will return the specified number of results or fewer. That’s no coincidence. You can think of these search methods as being a simple query language with powerful filters.

The search interface is very flexible but is outside the scope of this tutorial. You can check thelibrary’s documentation for more details or read yet another tutorial aboutweb scraping in Python that touches on BeautifulSoup.

Bind XML Data to Python Objects

Say you want to consume a real-time data feed over a low-latencyWebSocket connection with messages exchanged in XML format. For the purposes of this presentation, you’re going to use a web browser to broadcast your mouse and keyboard events to the Python server. You’ll build acustom protocol and usedata binding to translate XML into native Python objects.

The idea behind data binding is to define a data modeldeclaratively while letting the program figure out how to extract a valuable piece of information from the XML at runtime. If you’ve ever worked withDjango models, then this concept should sound familiar.

First, begin by designing your data model. It’s going to consist of two types of events:

KeyboardEvent
MouseEvent

Each can represent a few specialized subtypes, like a keypress or key release for the keyboard and a click or right-click for the mouse. Here’s a sample XML message produced in response to holding down theShift+2 key combination:

XML

<KeyboardEvent><Type>keydown</Type><Timestamp>253459.17999999982</Timestamp><Key><Code>Digit2</Code><Unicode>@</Unicode></Key><Modifiers><Alt>false</Alt><Ctrl>false</Ctrl><Shift>true</Shift><Meta>false</Meta></Modifiers></KeyboardEvent>

This message contains a specific keyboard event type, a timestamp, the key code and itsUnicode, as well as the modifier keys such asAlt,Ctrl, orShift. Themeta key is usually theWin orCmd key, depending on your keyboard layout.

Similarly, a mouse event could look like this:

XML

<MouseEvent><Type>mousemove</Type><Timestamp>52489.07000000145</Timestamp><Cursor><Deltax="-4"y="8"/><Windowx="171"y="480"/><Screenx="586"y="690"/></Cursor><ButtonsbitField="0"/><Modifiers><Alt>false</Alt><Ctrl>true</Ctrl><Shift>false</Shift><Meta>false</Meta></Modifiers></MouseEvent>

Instead of the key, however, there’s the mouse cursor position and abit field encoding themouse buttons pressed during the event. A bit field of zero indicates that no button was pressed.

As soon as a client makes the connection, it will start flooding the server with messages. The protocol won’t consist of any handshakes, heartbeats, graceful shutdowns, topic subscriptions, or control messages. You can code this in JavaScript by registering event handlers and creating aWebSocket object in less than fifty lines of code.

However, implementing the client isn’t the point of this exercise. Since you don’t need to understand it, just expand the collapsible section below to reveal the HTML code with embedded JavaScript and save it in a file named whatever you like.

HTML

<!DOCTYPE html><html><head><metacharset="utf-8"><title>Real-Time Data Feed</title></head><body><script>constws=newWebSocket("ws://localhost:8000")ws.onopen=event=>{["keydown","keyup"].forEach(name=>window.addEventListener(name,event=>ws.send(`\<KeyboardEvent>    <Type>${event.type}</Type>    <Timestamp>${event.timeStamp}</Timestamp>    <Key>        <Code>${event.code}</Code>        <Unicode>${event.key}</Unicode>    </Key>    <Modifiers>        <Alt>${event.altKey}</Alt>        <Ctrl>${event.ctrlKey}</Ctrl>        <Shift>${event.shiftKey}</Shift>        <Meta>${event.metaKey}</Meta>    </Modifiers></KeyboardEvent>`)));["mousedown","mouseup","mousemove"].forEach(name=>window.addEventListener(name,event=>ws.send(`\<MouseEvent>    <Type>${event.type}</Type>    <Timestamp>${event.timeStamp}</Timestamp>    <Cursor>        <Delta x="${event.movementX}" y="${event.movementY}"/>        <Window x="${event.clientX}" y="${event.clientY}"/>        <Screen x="${event.screenX}" y="${event.screenY}"/>    </Cursor>    <Buttons bitField="${event.buttons}"/>    <Modifiers>        <Alt>${event.altKey}</Alt>        <Ctrl>${event.ctrlKey}</Ctrl>        <Shift>${event.shiftKey}</Shift>        <Meta>${event.metaKey}</Meta>    </Modifiers></MouseEvent>`)))}</script></body></html>

The client connects to a local server listening on port 8000. Once you save the HTML code in a file, you’ll be able to open it with your favorite web browser. But before that, you’ll need to implement the server.

Python doesn’t come with WebSocket support, but you can install thewebsockets library into your active virtual environment. You’re also going to needlxml later, so it’s a good moment to install both dependencies in one go:

Shell

$python-mpipinstallwebsocketslxml

Finally, you can scaffold a minimal asynchronous web server:

Python

# server.pyimportasyncioimportwebsocketsasyncdefhandle_connection(websocket,path):asyncformessageinwebsocket:print(message)if__name__=="__main__":future=websockets.serve(handle_connection,"localhost",8000)asyncio.get_event_loop().run_until_complete(future)asyncio.get_event_loop().run_forever()

When you start the server and open the saved HTML file in a web browser, you should see XML messages appear in the standard output in response to your mouse moves and key presses. You can open the client in multiple tabs or even multiple browsers simultaneously!

Remove ads

Define Models With XPath Expressions

Right now, your messages arrive in plain string format. It’s not very convenient to work with the messages in this format. Fortunately, you can turn them into compound Python objects with a single line of code using thelxml.objectify module:

Python

# server.pyimportasyncioimportwebsocketsimportlxml.objectifyasyncdefhandle_connection(websocket,path):asyncformessageinwebsocket:try:xml=lxml.objectify.fromstring(message)exceptSyntaxError:print("Malformed XML message:",repr(message))else:ifxml.tag=="KeyboardEvent":ifxml.Type=="keyup":print("Key:",xml.Key.Unicode)elifxml.tag=="MouseEvent":screen=xml.Cursor.Screenprint("Mouse:",screen.get("x"),screen.get("y"))else:print("Unrecognized event type")# ...

As long as the XML parsing is successful, you can inspect the root element’s usual properties, such as the tag name, attributes, inner text, and so on. You’ll be able to use the dot operator to navigate deep into the element tree. In most cases, the library will recognize a suitable Python data type and convert the value for you.

After saving those changes and restarting the server, you’ll need to reload the page in your web browser to make a new WebSocket connection. Here’s a sample output of the modified program:

Shell

$pythonserver.pyMouse: 820 121Mouse: 820 122Mouse: 820 123Mouse: 820 124Mouse: 820 125Key: aMouse: 820 125Mouse: 820 125Key: aKey: AKey: ShiftMouse: 821 125Mouse: 821 125Mouse: 820 123⋮

Sometimes, XML may contain tag names that aren’t valid Python identifiers, or you might want to adapt the message structure to fit your data model. In such a case, an interesting option would be defining custommodel classes withdescriptors that declare how to look up information using XPath expressions. That’s the part that starts to resemble Django models orPydantic schema definitions.

You’re going to use a customXPath descriptor and an accompanyingModel class, which provide reusable properties for your data models. The descriptor expects an XPath expression for element lookup in the received message. The underlying implementation is a bit advanced, so feel free to copy the code from the collapsible section below.

Python

importlxml.objectifyclassXPath:def__init__(self,expression,/,default=None,multiple=False):self.expression=expressionself.default=defaultself.multiple=multipledef__set_name__(self,owner,name):self.attribute_name=nameself.annotation=owner.__annotations__.get(name)def__get__(self,instance,owner):value=self.extract(instance.xml)instance.__dict__[self.attribute_name]=valuereturnvaluedefextract(self,xml):elements=xml.xpath(self.expression)ifelements:ifself.multiple:ifself.annotation:return[self.annotation(x)forxinelements]else:returnelementselse:first=elements[0]ifself.annotation:returnself.annotation(first)else:returnfirstelse:returnself.defaultclassModel:"""Abstract base class for your models."""def__init__(self,data):ifisinstance(data,str):self.xml=lxml.objectify.fromstring(data)elifisinstance(data,lxml.objectify.ObjectifiedElement):self.xml=dataelse:raiseTypeError("Unsupported data type:",type(data))

Assuming you already have the desiredXPath descriptor and theModel abstract base class in your module, you might use them to defineKeyboardEvent andMouseEvent message types along with reusable building blocks to avoid repetition. There are infinite ways to do so, but here’s one example:

Python

# ...classEvent(Model):"""Base class for event messages with common elements."""type_:str=XPath("./Type")timestamp:float=XPath("./Timestamp")classModifiers(Model):alt:bool=XPath("./Alt")ctrl:bool=XPath("./Ctrl")shift:bool=XPath("./Shift")meta:bool=XPath("./Meta")classKeyboardEvent(Event):key:str=XPath("./Key/Code")modifiers:Modifiers=XPath("./Modifiers")classMouseEvent(Event):x:int=XPath("./Cursor/Screen/@x")y:int=XPath("./Cursor/Screen/@y")modifiers:Modifiers=XPath("./Modifiers")

TheXPath descriptor allows forlazy evaluation so that elements of the XML messages are looked up only when requested. More specifically, they’re only looked up when you access a property on the event object. Moreover, the results arecached to avoid running the same XPath query more than once. The descriptor also respectstype annotations and converts deserialized data to the right Python type automatically.

Using those event objects isn’t much different from the ones auto-generated bylxml.objectify before:

Python

ifxml.tag=="KeyboardEvent":event=KeyboardEvent(xml)ifevent.type_=="keyup":print("Key:",event.key)elifxml.tag=="MouseEvent":event=MouseEvent(xml)print("Mouse:",event.x,event.y)else:print("Unrecognized event type")

There’s an additional step of creating new objects of the specific event type. But other than that, it gives you more flexibility in terms of structuring your model independently of the XML protocol. Additionally, it’s possible to derive new model attributes based on the ones in the received messages and add more methods on top of that.

Generate Models From an XML Schema

Implementing model classes is a tedious and error-prone task. However, as long as your model mirrors the XML messages, you can take advantage of an automated tool to generate the necessary code for you based on XML Schema. The downside of such code is that it’s usually less readable than if written by hand.

One of the oldest third-party modules to allow that wasPyXB, which mimics Java’s popularJAXB library. Unfortunately, it was last released several years ago and was targeting legacy Python versions. You can look into a similar yet actively maintainedgenerateDS alternative, which generates data structures from XML Schema.

Let’s say you have thismodels.xsd schema file describing yourKeyboardEvent message:

XML

<xsd:schemaxmlns:xsd="http://www.w3.org/2001/XMLSchema"><xsd:elementname="KeyboardEvent"type="KeyboardEventType"/><xsd:complexTypename="KeyboardEventType"><xsd:sequence><xsd:elementtype="xsd:string"name="Type"/><xsd:elementtype="xsd:float"name="Timestamp"/><xsd:elementtype="KeyType"name="Key"/><xsd:elementtype="ModifiersType"name="Modifiers"/></xsd:sequence></xsd:complexType><xsd:complexTypename="KeyType"><xsd:sequence><xsd:elementtype="xsd:string"name="Code"/><xsd:elementtype="xsd:string"name="Unicode"/></xsd:sequence></xsd:complexType><xsd:complexTypename="ModifiersType"><xsd:sequence><xsd:elementtype="xsd:string"name="Alt"/><xsd:elementtype="xsd:string"name="Ctrl"/><xsd:elementtype="xsd:string"name="Shift"/><xsd:elementtype="xsd:string"name="Meta"/></xsd:sequence></xsd:complexType></xsd:schema>

A schema tells the XML parser what elements to expect, their order, and their level in the tree. It also restricts the allowed values for the XML attributes. Any discrepancies between these declarations and an actual XML document should render it invalid and make the parser reject the document.

Additionally, some tools can leverage this information to produce a piece of code that hides the details of XML parsing from you. After installing the library, you should be able to run thegenerateDS command in your active virtual environment:

Shell

$generateDS-omodels.pymodels.xsd

It will create a new file namedmodels.py in the same directory with the generated Python source code. You can then import that module and use it to parse the incoming messages:

Python

>>>frommodelsimportparseString>>>event=parseString("""\...<KeyboardEvent>...    <Type>keydown</Type>...    <Timestamp>253459.17999999982</Timestamp>...    <Key>...        <Code>Digit2</Code>...        <Unicode>@</Unicode>...    </Key>...    <Modifiers>...        <Alt>false</Alt>...        <Ctrl>false</Ctrl>...        <Shift>true</Shift>...        <Meta>false</Meta>...    </Modifiers>...</KeyboardEvent>""",silence=True)>>>event.Type,event.Key.Code('keydown', 'Digit2')

It looks similar to thelxml.objectify example shown earlier. The difference is that using data binding enforces compliance with the schema, whereaslxml.objectify produces objects dynamically no matter if they’re semantically correct.

Remove ads

Defuse the XML Bomb With Secure Parsers

The XML parsers in Python’s standard library are vulnerable to a host of security threats that can lead todenial-of-service (DoS) or data loss, at best. That isn’t their fault, to be fair. They just follow the specification of the XML standard, which is more complicated and powerful than most people know.

Note: Please be advised that you should use the information you’re about to see wisely. You don’t want to wind up being the attacker, exposing yourself to legal consequences, or facing lifetime banishment from using a particular service.

One of the most common attacks is theXML Bomb, also known as thebillion laughs attack. The attack exploitsentity expansion in DTD to blow up the memory and occupy the CPU for as long as possible. All you need to stop an unprotected web server from receiving new traffic are these few lines of XML code:

Python

importxml.etree.ElementTreeasETET.fromstring("""\<?xml version="1.0"?><!DOCTYPE lolz [ <!ENTITY lol "lol"> <!ELEMENT lolz (#PCDATA)> <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"> <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;"> <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"> <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;"> <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;"> <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;"> <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;"> <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;"> <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">]><lolz>&lol9;</lolz>""")

A naïve parser will try to resolve the custom entity&lol9; placed in the document root by inspecting the DTD. However, that entity itself refers to another entity several times, which refers to yet another entity, and so forth. When yourun the script above, you’ll notice something disturbing about your memory and the processing unit:

Look how the main memory and the swap partition are exhausted in just a matter of seconds while one of the CPUs works at 100% of its capacity. The recording stops abruptly when the system memory becomes full and then resumes after the Python process gets killed.

Another popular type of attack known asXXE takes advantage ofgeneral external entities to read local files and make network requests. Nevertheless, starting from Python 3.7.1, this feature has been disabled by default to increase security. If you trust your data, then you can tell the SAX parser to process external entities anyway:

Python

>>>fromxml.saximportmake_parser>>>fromxml.sax.handlerimportfeature_external_ges>>>parser=make_parser()>>>parser.setFeature(feature_external_ges,True)

This parser will be able to read local files on your computer. It may pull usernames on a Unix-like operating system, for example:

Python

>>>fromxml.dom.minidomimportparseString>>>xml="""\...<?xml version="1.0" encoding="UTF-8"?>...<!DOCTYPE root [...    <!ENTITY usernames SYSTEM "/etc/passwd">...]>...<root>&usernames;</root>""">>>document=parseString(xml,parser)>>>print(document.documentElement.toxml())<root>root:x:0:0:root:/root:/bin/bashdaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologinbin:x:2:2:bin:/bin:/usr/sbin/nologin⋮realpython:x:1001:1001:Real Python,,,:/home/realpython:/bin/bash</root>

It’s perfectly feasible tosend that data over the network to a remote server!

Now, how can you protect yourself from such attacks? The Python official documentation prominently warns you about the risks of using the built-in XML parsers and recommends switching to an external package in mission-critical applications. While not distributed with Python,defusedxml is adrop-in replacement for all the parsers in the standard library.

The library imposes strict limits and disables a lot of the dangerous XML features. It should stop most of the well-known attacks, including the two just described. To use it, grab the library from PyPI and replace your import statements accordingly:

Python

>>>importdefusedxml.ElementTreeasET>>>ET.parse("bomb.xml")Traceback (most recent call last):...raiseEntitiesForbidden(name,value,base,sysid,pubid,notation_name)defusedxml.common.EntitiesForbidden: EntitiesForbidden(name='lol', system_id=None, public_id=None)

That’s it! Forbidden features won’t make it through anymore.

Conclusion

The XML data format is a mature and surprisingly powerful standard that is still in use today, especially in the enterprise setting. Choosing the right XML parser is crucial in finding thesweet spot between performance, security, compliance, and convenience.

This tutorial puts a detailedroadmap in your hand to navigate the confusing maze of XML parsers in Python. You know where to take the shortcuts and how to avoid dead ends, saving you lots of time.

In this tutorial, you learned how to:

Choose the right XMLparsing model
Use the XML parsers in thestandard library
Use majorXML parsing libraries
Parse XML documents declaratively usingdata binding
Use safe XML parsers to eliminatesecurity vulnerabilities

Now, you understand the different strategies for parsing XML documents as well as their strengths and weaknesses. With this knowledge, you’re able to pick the most suitable XML parser for your specific use case and evencombine more than one to read multi-gigabyte XML files faster.

🐍 Python Tricks 💌

Get a short & sweetPython Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

AboutBartosz Zaczyński

Bartosz is an experienced software engineer and Python educator with an M.Sc. in Applied Computer Science.

» More about Bartosz

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren

David

Geir Arne

Sadie

MasterReal-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

MasterReal-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

LinkedIn Twitter Bluesky Facebook Email

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students.Get tips for asking good questions andget answers to common questions in our support portal.

Looking for a real-time conversation? Visit theReal Python Community Chat or join the next“Office Hours” Live Q&A Session. Happy Pythoning!

Keep Learning

Movatterモバイル変換

A Roadmap to XML Parsers in Python

Choose the Right XML Parsing Model

Document Object Model (DOM)

Simple API for XML (SAX)

Streaming API for XML (StAX)

Learn About XML Parsers in Python’s Standard Library

`xml.dom.minidom`: Minimal DOM Implementation

`xml.sax`: The SAX Interface for Python

`xml.dom.pulldom`: Streaming Pull Parser

`xml.etree.ElementTree`: A Lightweight, Pythonic Alternative

Explore Third-Party XML Parser Libraries

`untangle`: Convert XML to a Python Object

`xmltodict`: Convert XML to a Python Dictionary

`lxml`: Use ElementTree on Steroids

`BeautifulSoup`: Deal With Malformed XML

Bind XML Data to Python Objects

Define Models With XPath Expressions

Generate Models From an XML Schema

Defuse the XML Bomb With Secure Parsers

Conclusion

Keep reading Real Python by creating a free account or signing in:

Movatterモバイル変換

A Roadmap to XML Parsers in Python

Choose the Right XML Parsing Model

Document Object Model (DOM)

Simple API for XML (SAX)

Streaming API for XML (StAX)

Learn About XML Parsers in Python’s Standard Library

xml.dom.minidom: Minimal DOM Implementation

xml.sax: The SAX Interface for Python

xml.dom.pulldom: Streaming Pull Parser

xml.etree.ElementTree: A Lightweight, Pythonic Alternative

Explore Third-Party XML Parser Libraries

untangle: Convert XML to a Python Object

xmltodict: Convert XML to a Python Dictionary

lxml: Use ElementTree on Steroids

BeautifulSoup: Deal With Malformed XML

Bind XML Data to Python Objects

Define Models With XPath Expressions

Generate Models From an XML Schema

Defuse the XML Bomb With Secure Parsers

Conclusion

`xml.dom.minidom`: Minimal DOM Implementation

`xml.sax`: The SAX Interface for Python

`xml.dom.pulldom`: Streaming Pull Parser

`xml.etree.ElementTree`: A Lightweight, Pythonic Alternative

`untangle`: Convert XML to a Python Object

`xmltodict`: Convert XML to a Python Dictionary

`lxml`: Use ElementTree on Steroids

`BeautifulSoup`: Deal With Malformed XML