Like the tool?
Help making it better!
Your donation helps!

lxml
- lxml
- Why lxml?
  - Motto
  - Aims
- Installing lxml
- Benchmarks and Speed
- ElementTree compatibility of lxml.etree
- lxml FAQ - Frequently Asked Questions

Developing with lxml
- The lxml.etree Tutorial
- API reference
- APIs specific to lxml.etree
- Parsing XML and HTML with lxml
- Validation with lxml
- XPath and XSLT with lxml
  - XPath
  - XSLT
- lxml.objectify
- lxml.html
- lxml.cssselect
- BeautifulSoup Parser
- html5lib Parser
  - Differences to regular HTML parsing
  - Function Reference

Extending lxml
- Document loading and URL resolving
- Python extensions for XPath and XSLT
  - XPath Extension functions
  - XSLT extension elements
- Using custom Element classes in lxml
- Sax support
- The public C-API of lxml.etree

Like the tool?
Help making it better!
Your donation helps!

How to read the source of lxml

Author:	Stefan Behnel

This document describes how to read the source code oflxml and howto start working on it. You might also be interested in the companiondocument that describeshow to build lxml from sources.

Contents

What is Cython?

Cython is the language that lxml is written in. It is a veryPython-like language that was specifically designed for writing Pythonextension modules.

The reason why Cython (or actually its predecessorPyrex at the time)was chosen as an implementation language for lxml, is that it makes itvery easy to interface with both the Python world and external C code.Cython generates all the necessary glue code for the Python API,including Python types, calling conventions and reference counting.On the other side of the table, calling into C code is not more thandeclaring the signature of the function and maybe some variables asbeing C types, pointers or structs, and then calling it. The rest ofthe code is just plain Python code.

The Cython language is so close to Python that the Cython compiler canactually compile many, many Python programs to C without majormodifications. But the real speed gains of a C compilation come fromtype annotations that were added to the language and that allow Cythonto generate very efficient C code.

Even if you are not familiar with Cython, you should keep in mind thata slow implementation of a feature is better than none. So, if youwant to contribute and have an idea what code you want to write, feelfree to start with a pure Python implementation. Chances are, if youget the change officially accepted and integrated, others will takethe time to optimise it so that it runs fast in Cython.

Where to start?

First of all, readhow to build lxml from sources to learn how toretrieve the source code from the GitHub repository and how tobuild it. The source code lives in the subdirectorysrc of thecheckout.

The main extension modules in lxml arelxml.etree andlxml.objectify. All main modules have the file extension.pyx, which shows the descendence from Pyrex. As usual in Python,the main files start with a short description and a couple of imports.Cython distinguishes between the run-timeimport statement (asknown from Python) and the compile-timecimport statement, whichimports C declarations, either from external libraries or from otherCython modules.

Concepts

lxml's tree API is based on proxy objects. That means, every Elementobject (or rather_Element object) is a proxy for a libxml2 nodestructure. The class declaration is (mainly):

cdef class _Element:    cdef _Document _doc    cdef xmlNode* _c_node

It is a naming convention that C variables and C level class membersthat are passed into libxml2 start with a prefixedc_ (commonlylibxml2 struct pointers), and that C level class members are prefixedwith an underscore. So you will often see names likec_doc for anxmlDoc* variable (orc_node for anxmlNode*), or the above_c_node for a class member that points to anxmlNode struct(or_c_doc for anxmlDoc*).

It is important to know that every proxy in lxml has a factoryfunction that properly sets up C level members. Proxy objects mustnever be instantiated outside of that factory. For example, toinstantiate an _Element object or its subclasses, you must always callits factory function:

cdef xmlNode* c_nodecdef _Document doccdef _Element element...element = _elementFactory(doc, c_node)

A good place to see how this factory is used are the Element methodsgetparent(),getnext() andgetprevious().

The documentation

An important part of lxml is the documentation that lives in thedoc directory. It describes a large part of the API and comprisesa lot of example code in the form of doctests.

The documentation is written in theReStructured Text format, avery powerful text markup language that looks almost like plain text.It is part of thedocutils package.

The project web site oflxml is completely generated from these textdocuments. Even the side menu is just collected from the table ofcontents that the ReST processor writes into each HTML page.Obviously, we use lxml for this.

The easiest way to generate the HTML pages is by calling:

make html

This will call the scriptdoc/mkhtml.py to run the ReST processoron the files. After generating an HTML page the script parses it backin to build the side menu, and injects the complete menu into eachpage at the very end.

Running themake command will also generate the API documentationif you haveepydoc installed. The epydoc package will import andintrospect the extension modules and also introspect and parse thePython modules of lxml. The aggregated information will then bewritten out into an HTML documentation site.

lxml.etree

The main module,lxml.etree, is in the filelxml.etree.pyx. Itimplements the main functions and types of the ElementTree API, aswell as all the factory functions for proxies. It is the best placeto start if you want to find out how a specific feature isimplemented.

At the very end of the file, it contains a series ofincludestatements that merge the rest of the implementation into thegenerated C code. Yes, you read right: no importing, no source filenamespacing, just plain good old include and a huge C code result ofmore than 100,000 lines that we throw right into the C compiler.

The main include files are:

apihelpers.pxi

Private C helper functions. Except for the factory functions,most of the little functions that are used all over the place aredefined here. This includes things like reading out the textcontent of a libxml2 tree node, checking input from the API level,creating a new Element node or handling attribute values. If youwant to work on the lxml code, you should keep these functions inthe back of your head, as they will definitely make your lifeeasier.

classlookup.pxi

Element class lookup mechanisms. The main API and engines forthose who want to define custom Element classes and inject theminto lxml.

docloader.pxi

Support for custom document loaders. Base class and registry forcustom document resolvers.

extensions.pxi

Infrastructure for extension functions in XPath/XSLT, includingXPath value conversion and function registration.

iterparse.pxi

Incremental XML parsing. An iterator class that builds iterparseevents while parsing.

nsclasses.pxi

Namespace implementation and registry. The registry and enginefor Element classes that use the ElementNamespaceClassLookupscheme.

parser.pxi

Parsers for XML and HTML. This is the main parser engine. It'sthe reason why you can parse a document from various sources intwo lines of Python code. It's definitely not the right place tostart reading lxml's source code.

parsertarget.pxi

An ElementTree compatible parser target implementation based onthe SAX2 interface of libxml2.

proxy.pxi

Very low-level functions for memory allocation/deallocationand Element proxy handling. Ignoring this for the beginningwill safe your head from exploding.

public-api.pxi

The set of C functions that are exported to other extensionmodules at the C level. For example,lxml.objectify makes useof these. See theC-level API documentation.

readonlytree.pxi

A separate read-only implementation of the Element API. This isused in places where non-intrusive access to a tree is required,such as thePythonElementClassLookup or XSLT extensionelements.

saxparser.pxi

SAX-like parser interfaces as known from ElementTree's TreeBuilder.

serializer.pxi

XML output functions. Basically everything that creates bytesequences from XML trees.

xinclude.pxi

XInclude support.

xmlerror.pxi

Error log handling. All error messages that libxml2 generatesinternally walk through the code in this file to end up in lxml'sPython level error logs.

At the end of the file, you will find a long list of named errorcodes. It is generated from the libxml2 HTML documentation (usinglxml, of course). See the scriptupdate-error-constants.pyfor this.

xmlid.pxi

XMLID and IDDict, a dictionary-like way to find Elements by theirXML-ID attribute.

xpath.pxi

XPath evaluators.

xslt.pxi

XSL transformations, including theXSLT class, document lookuphandling and access control.

The different schema languages (DTD, RelaxNG, XML Schema andSchematron) are implemented in the following include files:

dtd.pxi
relaxng.pxi
schematron.pxi
xmlschema.pxi

Python modules

Thelxml package also contains a number of pure Python modules:

builder.py: The E-factory and the ElementBuilder class. These provide asimple interface to XML tree generation.
cssselect.py: A CSS selector implementation based on XPath. The main class iscalledCSSSelector.
doctestcompare.py: A relaxed comparison scheme for XML/HTML markup in doctest.
ElementInclude.py: XInclude-like document inclusion, compatible with ElementTree.
_elementpath.py: XPath-like path language, compatible with ElementTree.
sax.py: SAX2 compatible interfaces to copy lxml trees from/to SAX compatibletools.
usedoctest.py: Wrapper module fordoctestcompare.py that simplifies its usagefrom inside a doctest.

lxml.objectify

A Cython implemented extension module that uses the public C-API oflxml.etree. It provides a Python object-like interface to XML trees.The implementation resides in the filelxml.objectify.pyx.

lxml.html

A specialised toolkit for HTML handling, based on lxml.etree. This isimplemented in pure Python.

Movatterモバイル変換