Author: | Stefan Behnel |
---|
This document describes how to read the source code oflxml and howto start working on it. You might also be interested in the companiondocument that describeshow to build lxml from sources.
Contents
Cython is the language that lxml is written in. It is a veryPython-like language that was specifically designed for writing Pythonextension modules.
The reason why Cython (or actually its predecessorPyrex at the time)was chosen as an implementation language for lxml, is that it makes itvery easy to interface with both the Python world and external C code.Cython generates all the necessary glue code for the Python API,including Python types, calling conventions and reference counting.On the other side of the table, calling into C code is not more thandeclaring the signature of the function and maybe some variables asbeing C types, pointers or structs, and then calling it. The rest ofthe code is just plain Python code.
The Cython language is so close to Python that the Cython compiler canactually compile many, many Python programs to C without majormodifications. But the real speed gains of a C compilation come fromtype annotations that were added to the language and that allow Cythonto generate very efficient C code.
Even if you are not familiar with Cython, you should keep in mind thata slow implementation of a feature is better than none. So, if youwant to contribute and have an idea what code you want to write, feelfree to start with a pure Python implementation. Chances are, if youget the change officially accepted and integrated, others will takethe time to optimise it so that it runs fast in Cython.
First of all, readhow to build lxml from sources to learn how toretrieve the source code from the GitHub repository and how tobuild it. The source code lives in the subdirectorysrc of thecheckout.
The main extension modules in lxml arelxml.etree andlxml.objectify. All main modules have the file extension.pyx, which shows the descendence from Pyrex. As usual in Python,the main files start with a short description and a couple of imports.Cython distinguishes between the run-timeimport statement (asknown from Python) and the compile-timecimport statement, whichimports C declarations, either from external libraries or from otherCython modules.
lxml's tree API is based on proxy objects. That means, every Elementobject (or rather_Element object) is a proxy for a libxml2 nodestructure. The class declaration is (mainly):
cdef class _Element: cdef _Document _doc cdef xmlNode* _c_node
It is a naming convention that C variables and C level class membersthat are passed into libxml2 start with a prefixedc_ (commonlylibxml2 struct pointers), and that C level class members are prefixedwith an underscore. So you will often see names likec_doc for anxmlDoc* variable (orc_node for anxmlNode*), or the above_c_node for a class member that points to anxmlNode struct(or_c_doc for anxmlDoc*).
It is important to know that every proxy in lxml has a factoryfunction that properly sets up C level members. Proxy objects mustnever be instantiated outside of that factory. For example, toinstantiate an _Element object or its subclasses, you must always callits factory function:
cdef xmlNode* c_nodecdef _Document doccdef _Element element...element = _elementFactory(doc, c_node)
A good place to see how this factory is used are the Element methodsgetparent(),getnext() andgetprevious().
An important part of lxml is the documentation that lives in thedoc directory. It describes a large part of the API and comprisesa lot of example code in the form of doctests.
The documentation is written in theReStructured Text format, avery powerful text markup language that looks almost like plain text.It is part of thedocutils package.
The project web site oflxml is completely generated from these textdocuments. Even the side menu is just collected from the table ofcontents that the ReST processor writes into each HTML page.Obviously, we use lxml for this.
The easiest way to generate the HTML pages is by calling:
make html
This will call the scriptdoc/mkhtml.py to run the ReST processoron the files. After generating an HTML page the script parses it backin to build the side menu, and injects the complete menu into eachpage at the very end.
Running themake command will also generate the API documentationif you haveepydoc installed. The epydoc package will import andintrospect the extension modules and also introspect and parse thePython modules of lxml. The aggregated information will then bewritten out into an HTML documentation site.
The main module,lxml.etree, is in the filelxml.etree.pyx. Itimplements the main functions and types of the ElementTree API, aswell as all the factory functions for proxies. It is the best placeto start if you want to find out how a specific feature isimplemented.
At the very end of the file, it contains a series ofincludestatements that merge the rest of the implementation into thegenerated C code. Yes, you read right: no importing, no source filenamespacing, just plain good old include and a huge C code result ofmore than 100,000 lines that we throw right into the C compiler.
The main include files are:
Error log handling. All error messages that libxml2 generatesinternally walk through the code in this file to end up in lxml'sPython level error logs.
At the end of the file, you will find a long list of named errorcodes. It is generated from the libxml2 HTML documentation (usinglxml, of course). See the scriptupdate-error-constants.pyfor this.
The different schema languages (DTD, RelaxNG, XML Schema andSchematron) are implemented in the following include files:
Thelxml package also contains a number of pure Python modules:
A Cython implemented extension module that uses the public C-API oflxml.etree. It provides a Python object-like interface to XML trees.The implementation resides in the filelxml.objectify.pyx.
A specialised toolkit for HTML handling, based on lxml.etree. This isimplemented in pure Python.