lxml has very sophisticated support for custom Element classes. Youcan provide your own classes for Elements and have lxml use them bydefault for all elements generated by a specific parser, only for aspecific tag name in a specific namespace or even for an exact elementat a specific position in the tree.
Custom Elements must inherit from thelxml.etree.ElementBase class, whichprovides the Element interface for subclasses:
>>>fromlxmlimportetree>>>classhonk(etree.ElementBase):...@property...defhonking(self):...returnself.get('honking')=='true'
This defines a new Element classhonk with a propertyhonking.
The following document describes how you can make lxml.etree use thesecustom Element classes.
Contents
Being based on libxml2, lxml.etree holds the entire XML tree in a Cstructure. To communicate with Python code, it creates Python proxyobjects for the XML elements on demand.
The mapping between C elements and Python Element classes iscompletely configurable. When you ask lxml.etree for an Element byusing its API, it will instantiate your classes for you. All you haveto do is tell lxml which class to use for which kind of Element. Thisis done through a class lookup scheme, as described in the sectionsbelow.
There is one thing to know up front. Element classesmust not havean__init___ or__new__ method. There should not be anyinternal state either, except for the data stored in the underlyingXML tree. Element instances are created and garbage collected atneed, so there is normally no way to predict when and how often aproxy is created for them. Even worse, when the__init__ methodis called, the object is not even initialized yet to represent the XMLtag, so there is not much use in providing an__init__ method insubclasses.
Most use cases will not require any class initialisation or proxystate, so you can content yourself with skipping to the next sectionfor now. However, if you really need to set up your element class oninstantiation, or need a way to persistently store state in the proxyinstances instead of the XML tree, here is a way to do so.
There is one important guarantee regarding Element proxies. Once aproxy has been instantiated, it will keep alive as long as there is aPython reference to it, and any access to the XML element in the treewill return this very instance. Therefore, if you need to store localstate in a custom Element class (which is generally discouraged), youcan do so by keeping the Elements in a tree alive. If the treedoesn't change, you can simply do this:
proxy_cache=list(root.iter())
or
proxy_cache=set(root.iter())
or use any other suitable container. Note that you have to keep thiscache manually up to date if the tree changes, which can get tricky incases.
For proxy initialisation, ElementBase classes have an_init()method that can be overridden, as oppose to the normal__init__()method. It can be used to modify the XML tree, e.g. to constructspecial children or verify and update attributes.
The semantics of_init() are as follows:
The first thing to do when deploying custom element classes is to register aclass lookup scheme on a parser. lxml.etree provides quite a number ofdifferent schemes that also support class lookup based on namespaces orattribute values. Most lookups support fallback chaining, which allows thenext lookup mechanism to take over when the previous one fails to find aclass.
For example, setting thehonk Element as a default element classfor a parser works as follows:
>>>parser_lookup=etree.ElementDefaultClassLookup(element=honk)>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(parser_lookup)
There is one drawback of the parser based scheme: theElement() factorydoes not know about your specialised parser and creates a new document thatdeploys the default parser:
>>>el=etree.Element("root")>>>print(isinstance(el,honk))False
You should therefore avoid using this factory function in code thatuses custom classes. Themakeelement() method of parsers providesa simple replacement:
>>>el=parser.makeelement("root")>>>print(isinstance(el,honk))True
If you use a parser at the module level, you can easily redirect a modulelevelElement() factory to the parser method by adding code like this:
>>>module_level_parser=etree.XMLParser()>>>Element=module_level_parser.makeelement
While theXML() andHTML() factories also depend on the defaultparser, you can pass them a different parser as second argument:
>>>element=etree.XML("<test/>")>>>print(isinstance(element,honk))False>>>element=etree.XML("<test/>",parser)>>>print(isinstance(element,honk))True
Whenever you create a document with a parser, it will inherit the lookupscheme and all subsequent element instantiations for this document will useit:
>>>element=etree.fromstring("<test/>",parser)>>>print(isinstance(element,honk))True>>>el=etree.SubElement(element,"subel")>>>print(isinstance(el,honk))True
For testing code in the Python interpreter and for small projects, youmay also consider setting a lookup scheme on the default parser. Toavoid interfering with other modules, however, it is usually a betteridea to use a dedicated parser for each module (or a parser pool whenusing threads) and then register the required lookup scheme only forthis parser.
This is the most simple lookup mechanism. It always returns the defaultelement class. Consequently, no further fallbacks are supported, but thisscheme is a nice fallback for other custom lookup mechanisms. Specifically,it also handles comments and processing instructions, which are easy toforget about when mapping proxies to classes.
Usage:
>>>lookup=etree.ElementDefaultClassLookup()>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(lookup)
Note that the default for new parsers is to use the global fallback, which isalso the default lookup (if not configured otherwise).
To change the default element implementation, you can pass your new class tothe constructor. While it accepts classes forelement,comment andpi nodes, most use cases will only override the element class:
>>>el=parser.makeelement("myelement")>>>print(isinstance(el,honk))False>>>lookup=etree.ElementDefaultClassLookup(element=honk)>>>parser.set_element_class_lookup(lookup)>>>el=parser.makeelement("myelement")>>>print(isinstance(el,honk))True>>>el.honkingFalse>>>el=parser.makeelement("myelement",honking='true')>>>etree.tostring(el)b'<myelement honking="true"/>'>>>el.honkingTrue>>>root=etree.fromstring(...'<root honking="true"><!--comment--></root>',parser)>>>root.honkingTrue>>>print(root[0].text)comment
This is an advanced lookup mechanism that supports namespace/tag-name specificelement classes. You can select it by calling:
>>>lookup=etree.ElementNamespaceClassLookup()>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(lookup)
See the separate section onimplementing namespaces below to learn how tomake use of it.
This scheme supports a fallback mechanism that is used in the case where thenamespace is not found or no class was registered for the element name.Normally, the default class lookup is used here. To change it, pass thedesired fallback lookup scheme to the constructor:
>>>fallback=etree.ElementDefaultClassLookup(element=honk)>>>lookup=etree.ElementNamespaceClassLookup(fallback)>>>parser.set_element_class_lookup(lookup)>>>root=etree.fromstring(...'<root honking="true"><!--comment--></root>',parser)>>>root.honkingTrue>>>print(root[0].text)comment
This scheme uses a mapping from attribute values to classes. An attributename is set at initialisation time and is then used to find the correspondingvalue in a dictionary. It is set up as follows:
>>>id_class_mapping={'1234':honk}# maps attribute values to classes>>>lookup=etree.AttributeBasedElementClassLookup(...'id',id_class_mapping)>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(lookup)
And here is how to use it:
>>>xml='<a><b/><b honking="true"/></a>'>>>a=etree.fromstring(xml,parser)>>>a.honking# id does not match !Traceback (most recent call last):AttributeError:'lxml.etree._Element' object has no attribute 'honking'>>>a[0].honkingFalse>>>a[1].honkingTrue
This lookup scheme uses its fallback if the attribute is not found orits value is not in the mapping. Normally, the default class lookupis used here. If you want to use the namespace lookup, for example,you can use this code:
>>>fallback=etree.ElementNamespaceClassLookup()>>>lookup=etree.AttributeBasedElementClassLookup(...'id',id_class_mapping,fallback)>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(lookup)
This is the most customisable way of finding element classes on a per-elementbasis. It allows you to implement a custom lookup scheme in a subclass:
>>>classMyLookup(etree.CustomElementClassLookup):...deflookup(self,node_type,document,namespace,name):...ifnode_type=='element':...returnhonk# be a bit more selective here ......else:...returnNone# pass on to (default) fallback>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(MyLookup())>>>root=etree.fromstring(...'<root honking="true"><!--comment--></root>',parser)>>>root.honkingTrue>>>print(root[0].text)comment
The.lookup() method must return either None (which triggers thefallback mechanism) or a subclass oflxml.etree.ElementBase. Itcan take any decision it wants based on the node type (one of"element", "comment", "PI", "entity"), the XML document of theelement, or its namespace or tag name.
Taking more elaborate decisions than allowed by the custom scheme isdifficult to achieve in pure Python, as it results in achicken-and-egg problem. It would require access to the tree - beforethe elements in the tree have been instantiated as Python Elementproxies.
Luckily, there is a way to do this. ThePythonElementClassLookupworks similar to the custom lookup scheme:
>>>classMyLookup(etree.PythonElementClassLookup):...deflookup(self,document,element):...returnMyElementClass# defined elsewhere>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(MyLookup())
As before, the first argument to thelookup() method is the opaquedocument instance that contains the Element. The second arguments is alightweight Element proxy implementation that is only valid during the lookup.Do not try to keep a reference to it. Once the lookup is finished, the proxywill become invalid. You will get anAssertionError if you access any ofthe properties or methods outside the scope of the lookup call where they wereinstantiated.
During the lookup, the element object behaves mostly like a normal Elementinstance. It provides the propertiestag,text,tail etc. andsupports indexing, slicing and thegetchildren(),getparent()etc. methods. It doesnot support iteration, nor does it support any kindof modification. All of its properties are read-only and it cannot be removedor inserted into other trees. You can use it as a starting point to freelytraverse the tree and collect any kind of information that its elementsprovide. Once you have taken the decision which class to use for thiselement, you can simply return it and have lxml take care of cleaning up theinstantiated proxy classes.
Sidenote: this lookup scheme originally lived in a separate module calledlxml.pyclasslookup.
Up to lxml 2.1, you could not instantiate proxy classes yourself.Only lxml.etree could do that when creating an object representationof an existing XML element. Since lxml 2.2, however, instantiatingthis class will simply create a new Element:
>>>el=honk(honking='true')>>>el.tag'honk'>>>el.honkingTrue
Note, however, that the proxy you create here will be garbagecollected just like any other proxy. You can therefore not count onlxml.etree using the same class that you instantiated when you accessthis Element a second time after letting its reference go. You shouldtherefore always use a corresponding class lookup scheme that returnsyour Element proxy classes for the elements that they create. TheElementNamespaceClassLookup is generally a good match.
You can use custom Element classes to quickly create XML fragments:
>>>classhale(etree.ElementBase):pass>>>classbopp(etree.ElementBase):pass>>>el=hale("some ",honk(honking='true'),bopp," text")>>>print(etree.tostring(el,encoding='unicode'))<hale>some <honk honking="true"/><bopp/> text</hale>
lxml allows you to implement namespaces, in a rather literal sense. Aftersetting up the namespace class lookup mechanism as described above, you canbuild a new element namespace (or retrieve an existing one) by calling theget_namespace(uri) method of the lookup:
>>>lookup=etree.ElementNamespaceClassLookup()>>>parser=etree.XMLParser()>>>parser.set_element_class_lookup(lookup)>>>namespace=lookup.get_namespace('http://hui.de/honk')
and then register the new element type with that namespace, say, under the tagnamehonk:
>>>namespace['honk']=honk
If you have many Element classes declared in one module, and they areall named like the elements they create, you can simply usenamespace.update(globals()) at the end of your module to declare themautomatically. The implementation is smart enough to ignoreeverything that is not an Element class.
After this, you create and use your XML elements through the normal API oflxml:
>>>xml='<honk xmlns="http://hui.de/honk" honking="true"/>'>>>honk_element=etree.XML(xml,parser)>>>print(honk_element.honking)True
The same works when creating elements by hand:
>>>honk_element=parser.makeelement('{http://hui.de/honk}honk',...honking='true')>>>print(honk_element.honking)True
Essentially, what this allows you to do, is to give Elements a custom APIbased on their namespace and tag name.
A somewhat related topic areextension functions which use a similarmechanism for registering Python functions for use in XPath and XSLT.
In the setup example above, we associated thehonk Element classonly with the 'honk' element. If an XML tree contains differentelements in the same namespace, they do not pick up the sameimplementation:
>>>xml=('<honk xmlns="http://hui.de/honk" honking="true">'...'<bla/><!--comment-->'...'</honk>')>>>honk_element=etree.XML(xml,parser)>>>print(honk_element.honking)True>>>print(honk_element[0].honking)Traceback (most recent call last):...AttributeError:'lxml.etree._Element' object has no attribute 'honking'>>>print(honk_element[1].text)comment
You can therefore provide one implementation per element name in eachnamespace and have lxml select the right one on the fly. If you want oneelement implementation per namespace (ignoring the element name) or preferhaving a common class for most elements except a few, you can specify adefault implementation for an entire namespace by registering that class withthe empty element name (None).
You may consider following an object oriented approach here. If you build aclass hierarchy of element classes, you can also implement a base class for anamespace that is used if no specific element class is provided. Again, youcan just pass None as an element name:
>>>classHonkNSElement(etree.ElementBase):...defhonk(self):...return"HONK">>>namespace[None]=HonkNSElement# default Element for namespace>>>classHonkElement(HonkNSElement):...@property...defhonking(self):...returnself.get('honking')=='true'>>>namespace['honk']=HonkElement# Element for specific tag
Now you can rely on lxml to always return objects of type HonkNSElement or itssubclasses for elements of this namespace:
>>>xml=('<honk xmlns="http://hui.de/honk" honking="true">'...'<bla/><!--comment-->'...'</honk>')>>>honk_element=etree.fromstring(xml,parser)>>>print(type(honk_element))<class 'HonkElement'>>>>print(type(honk_element[0]))<class 'HonkNSElement'>>>>print(honk_element.honking)True>>>print(honk_element.honk())HONK>>>print(honk_element[0].honk())HONK>>>print(honk_element[0].honking)Traceback (most recent call last):...AttributeError:'HonkNSElement' object has no attribute 'honking'>>>print(honk_element[1].text)# uses fallback for non-elementscomment
Since lxml 4.1, the registration is more conveniently done withclass decorators. The namespace registry object is callable witha name (orNone) as argument and can then be used as decorator.
>>>honk_elements=lookup.get_namespace('http://hui.de/honk')>>>@honk_elements(None)...classHonkNSElement(etree.ElementBase):...defhonk(self):...return"HONK"
If the class has the same name as the tag, you can also leave out the calland use the blank decorator instead:
>>>@honk_elements...classhonkel(HonkNSElement):...@property...defhonking(self):...returnself.get('honking')=='true'>>>xml='<honkel xmlns="http://hui.de/honk" honking="true"><bla/><!--comment--></honkel>'>>>honk_element=etree.fromstring(xml,parser)>>>print(type(honk_element))<class 'honkel'>>>>print(type(honk_element[0]))<class 'HonkNSElement'>