Document loading and URL resolving

Contents

The normal way to load external entities (such as DTDs) is by usingXML catalogs. Lxml also has support for user provided documentloaders in both the parsers and XSL transformations. These so-calledresolvers are subclasses of the etree.Resolver class.

XML Catalogs

When loading an external entity for a document, e.g. a DTD, the parseris normally configured to prevent network access (see theno_network parser option). Instead, it will try to load theentity from their local file system path or, in the most common casethat the entity uses a network URL as reference, from a local XMLcatalog.

XML catalogs are the preferred and agreed-on mechanism to loadexternal entities from XML processors. Most tools will use them, soit is worth configuring them properly on a system. Many Linuxinstallations use them by default, but on other systems they may needto get enabled manually. Thelibxml2 site has some documentationonhow to set up XML catalogs

URI Resolvers

Here is an example of a custom resolver:

>>>fromlxmlimportetree>>>classDTDResolver(etree.Resolver):...defresolve(self,url,id,context):...print("Resolving URL '%s'"%url)...returnself.resolve_string(...'<!ENTITY myentity "[resolved text:%s]">'%url,context)

This defines a resolver that always returns a dynamically generated DTDfragment defining an entity. Theurl argument passes the system URL ofthe requested document, theid argument is the public ID. Note that anyof these may be None. The context object is not normally used by client code.

Resolving is based on the methods of the Resolver object that buildinternal representations of the result document. The followingmethods exist:

resolve_string takes a parsable string as result document
resolve_filename takes a filename
resolve_file takes an open file-like object that has at least a read() method
resolve_empty resolves into an empty document

Theresolve() method may choose to return None, in which case the nextregistered resolver (or the default resolver) is consulted. Resolving alwaysterminates ifresolve() returns the result of any of the aboveresolve_*() methods.

Resolvers are registered local to a parser:

>>>parser=etree.XMLParser(load_dtd=True)>>>parser.resolvers.add(DTDResolver())

Note that we instantiate a parser that loads the DTD. This is not done by thedefault parser, which does no validation. When we use this parser to parse adocument that requires resolving a URL, it will call our custom resolver:

>>>xml='<!DOCTYPE doc SYSTEM "MissingDTD.dtd"><doc>&myentity;</doc>'>>>tree=etree.parse(StringIO(xml),parser)Resolving URL 'MissingDTD.dtd'>>>root=tree.getroot()>>>print(root.text)[resolved text: MissingDTD.dtd]

The entity in the document was correctly resolved by the generated DTDfragment.

Document loading in context

XML documents memorise their initial parser (and its resolvers) during theirlife-time. This means that a lookup process related to a document will usethe resolvers of the document's parser. We can demonstrate this with aresolver that only responds to a specific prefix:

>>>classPrefixResolver(etree.Resolver):...def__init__(self,prefix):...self.prefix=prefix...self.result_xml='''\...             <xsl:stylesheet...                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...               <test xmlns="testNS">%s-TEST</test>...             </xsl:stylesheet>...             '''%prefix...defresolve(self,url,pubid,context):...ifurl.startswith(self.prefix):...print("Resolved url%s as prefix%s"%(url,self.prefix))...returnself.resolve_string(self.result_xml,context)

We demonstrate this in XSLT and use the following stylesheet as an example:

>>>xml_text="""\...<xsl:stylesheet version="1.0"...   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">...  <xsl:include href="honk:test"/>...  <xsl:template match="/">...    <test>...      <xsl:value-of select="document('hoi:test')/*/*/text()"/>...    </test>...  </xsl:template>...</xsl:stylesheet>..."""

Note that it needs to resolve two URIs:honk:test when compiling the XSLTdocument (i.e. when resolvingxsl:import andxsl:include elements) andhoi:test at transformation time, when calls to thedocument functionare resolved. If we now register different resolvers with two differentparsers, we can parse our document twice in different resolver contexts:

>>>hoi_parser=etree.XMLParser()>>>normal_doc=etree.parse(StringIO(xml_text),hoi_parser)>>>hoi_parser.resolvers.add(PrefixResolver("hoi"))>>>hoi_doc=etree.parse(StringIO(xml_text),hoi_parser)>>>honk_parser=etree.XMLParser()>>>honk_parser.resolvers.add(PrefixResolver("honk"))>>>honk_doc=etree.parse(StringIO(xml_text),honk_parser)

These contexts are important for the further behaviour of the documents. Theymemorise their original parser so that the correct set of resolvers is used insubsequent lookups. To compile the stylesheet, XSLT must resolve thehonk:test URI in thexsl:include element. Thehoi resolver cannotdo that:

>>>transform=etree.XSLT(normal_doc)Traceback (most recent call last):...lxml.etree.XSLTParseError:Cannot resolve URI honk:test>>>transform=etree.XSLT(hoi_doc)Traceback (most recent call last):...lxml.etree.XSLTParseError:Cannot resolve URI honk:test

However, if we use thehonk resolver associated with the respectivedocument, everything works fine:

>>>transform=etree.XSLT(honk_doc)Resolved url honk:test as prefix honk

Running the transform accesses the same parser context again, but since it nowneeds to resolve thehoi URI in the call to the document function, itshonk resolver will fail to do so:

>>>result=transform(normal_doc)Traceback (most recent call last):...lxml.etree.XSLTApplyError:Cannot resolve URI hoi:test>>>result=transform(hoi_doc)Traceback (most recent call last):...lxml.etree.XSLTApplyError:Cannot resolve URI hoi:test>>>result=transform(honk_doc)Traceback (most recent call last):...lxml.etree.XSLTApplyError:Cannot resolve URI hoi:test

This can only be solved by adding ahoi resolver to the original parser:

>>>honk_parser.resolvers.add(PrefixResolver("hoi"))>>>result=transform(honk_doc)Resolved url hoi:test as prefix hoi>>>print(str(result)[:-1])<?xml version="1.0"?><test>hoi-TEST</test>

We can see that thehoi resolver was called to generate a document thatwas then inserted into the result document by the XSLT transformation. Notethat this is completely independent of the XML file you transform, as the URIis resolved from within the stylesheet context:

>>>result=transform(normal_doc)Resolved url hoi:test as prefix hoi>>>print(str(result)[:-1])<?xml version="1.0"?><test>hoi-TEST</test>

It may be seen as a matter of taste what resolvers the generated documentinherits. For XSLT, the output document inherits the resolvers of the inputdocument and not those of the stylesheet. Therefore, the last result does notinherit any resolvers at all.

I/O access control in XSLT

By default, XSLT supports all extension functions from libxslt and libexslt aswell as Python regular expressions through EXSLT. Some extensions enablestyle sheets to read and write files on the local file system.

XSLT has a mechanism to control the access to certain I/O operations duringthe transformation process. This is most interesting where XSL scripts comefrom potentially insecure sources and must be prevented from modifying thelocal file system. Note, however, that there is no way to keep them fromeating up your precious CPU time, so this should not stop you from thinkingabout what XSLT you execute.

Access control is configured using theXSLTAccessControl class. It can becalled with a number of keyword arguments that allow or deny specificoperations:

>>>transform=etree.XSLT(honk_doc)Resolved url honk:test as prefix honk>>>result=transform(normal_doc)Resolved url hoi:test as prefix hoi>>>ac=etree.XSLTAccessControl(read_network=False,read_file=False)>>>transform=etree.XSLT(honk_doc,access_control=ac)Resolved url honk:test as prefix honk>>>result=transform(normal_doc)Traceback (most recent call last):...lxml.etree.XSLTApplyError:xsltLoadDocument: read rights for hoi:test denied

There are a few things to keep in mind:

XSL parsing (xsl:import, etc.) is not affected by this mechanism
read_file=False does not implywrite_file=False, all controls areindependent.
read_file only applies to files in the file system. Any other schemefor URLs is controlled by the*_network keywords.
If you need more fine-grained control than switching access on and off, youshould consider writing a custom document loader that returns emptydocuments or raises exceptions if access is denied.

Movatterモバイル変換

Document loading and URL resolving

XML Catalogs

URI Resolvers

Document loading in context

I/O access control in XSLT