Author: | Ian Bicking |
---|
Since version 2.0, lxml comes with a dedicated Python package fordealing with HTML:lxml.html. It is based on lxml's HTML parser,but provides a special Element API for HTML elements, as well as anumber of utilities for common HTML processing tasks.
Contents
The main API is based on thelxml.etree API, and thus, on theElementTreeAPI.
There are several functions available to parse HTML:
Parses the named file or url, or if the object has a.read()method, parses from that.
If you give a URL, or if the object has a.geturl() method (asfile-like objects fromurllib.urlopen() have), then that URLis used as the base URL. You can also provide an explicitbase_url keyword argument.
The normal HTML parser is capable of handling broken HTML, but forpages that are far enough from HTML to call them 'tag soup', it maystill fail to parse the page in a useful way. A way to deal with thisisElementSoup, which deploys the well-knownBeautifulSoup parser tobuild an lxml HTML tree.
However, note that the most common problem with web pages is the lackof (or the existence of incorrect) encoding declarations. It istherefore often sufficient to only use the encoding detection ofBeautifulSoup, called UnicodeDammit, and to leave the rest to lxml'sown HTML parser, which is several times faster.
HTML elements have all the methods that come with ElementTree, butalso include some extra methods:
One of the interesting modules in thelxml.html package deals withdoctests. It can be hard to compare two HTML pages for equality, aswhitespace differences aren't meaningful and the structural formattingcan differ. This is even more a problem in doctests, where output istested for equality and small differences in whitespace or the orderof attributes can let a test fail. And given the verbosity oftag-based languages, it may take more than a quick look to find theactual differences in the doctest output.
Luckily, lxml provides thelxml.doctestcompare module thatsupports relaxed comparison of XML and HTML pages and provides areadable diff in the output when a test fails. The HTML comparison ismost easily used by importing theusedoctest module in a doctest:
>>>importlxml.html.usedoctest
Now, if you have an HTML document and want to compare it to an expected resultdocument in a doctest, you can do the following:
>>>importlxml.html>>>html=lxml.html.fromstring('''\... <html><body onload="" color="white">... <p>Hi !</p>... </body></html>...''')>>>printlxml.html.tostring(html)<html><body onload="" color="white"><p>Hi !</p></body></html>>>>printlxml.html.tostring(html)<html> <body color="white" onload=""> <p>Hi !</p> </body> </html>>>>printlxml.html.tostring(html)<html> <body color="white" onload=""> <p>Hi !</p> </body></html>
In documentation, you would likely prefer the pretty printed HTML output, asit is the most readable. However, the three documents are equivalent from thepoint of view of an HTML tool, so the doctest will silently accept any of theabove. This allows you to concentrate on readability in your doctests, evenif the real output is a straight ugly HTML one-liner.
Note that there is also anlxml.usedoctest module which you canimport for XML comparisons. The HTML parser notably ignoresnamespaces and some other XMLisms.
lxml.html comes with a predefined HTML vocabulary for theE-factory,originally written by Fredrik Lundh. This allows you to quickly generate HTMLpages and fragments:
>>>fromlxml.htmlimportbuilderasE>>>fromlxml.htmlimportusedoctest>>>html=E.HTML(...E.HEAD(...E.LINK(rel="stylesheet",href="great.css",type="text/css"),...E.TITLE("Best Page Ever")...),...E.BODY(...E.H1(E.CLASS("heading"),"Top News"),...E.P("World News only on this page",style="font-size: 200%"),..."Ah, and here's some more text, by the way.",...lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")...)...)>>>printlxml.html.tostring(html)<html> <head> <link href="great.css" rel="stylesheet" type="text/css"> <title>Best Page Ever</title> </head> <body> <h1>Top News</h1> <p>World News only on this page</p> Ah, and here's some more text, by the way. <p>... and this is a parsed fragment ...</p> </body></html>
Note that you should uselxml.html.tostring andnotlxml.tostring.lxml.tostring(doc) will return the XMLrepresentation of the document, which is not valid HTML. Inparticular, things like<scriptsrc="..."></script> will beserialized as<scriptsrc="..." />, which completely confusesbrowsers.
A handy method for viewing your HTML:lxml.html.open_in_browser(lxml_doc) will write the document todisk and open it in a browser (with thewebbrowser module).
There are several methods on elements that allow you to see and modifythe links in a document.
This yields(element, attribute, link, pos) for every link inthe document.attribute may be None if the link is in thetext (as will be the case with a<style> tag with@import).
This finds any link in anaction,archive,background,cite,classid,codebase,data,href,longdesc,profile,src,usemap,dynsrc, orlowsrc attribute. It also searchesstyle attributes forurl(link), and<style> tags for@import andurl().
This function doesnot pay attention to<base href>.
This makes all links in the document absolute, assuming thatbase_href is the URL of the document. So if you passbase_href="http://localhost/foo/bar.html" and there is a linktobaz.html that will be rewritten ashttp://localhost/foo/baz.html.
Ifresolve_base_href is true, then any<base href> tagwill be taken into account (just callingself.resolve_base_href()).
This rewrites all the links in the document using your given linkreplacement function. If you give abase_href value, alllinks will be passed in after they are joined with this URL.
For each linklink_repl_func(link) is called. That functionthen returns the new link, or None to remove the attribute or tagthat contains the link. Note that all links will be passed in,including links like"#anchor" (which is purely internal), andthings like"mailto:bob@example.com" (or#"text" name="name"> <br>... Your phone: <input type="text" name="phone"> <br>... Your favorite pets: <br>... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>... Cats: <input type="checkbox" name="interest" value="cats"> <br>... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>... <input type="submit"></form></body></html>''')>>>form=form_page.forms[0]>>>form.fields=dict(...name='John Smith',...phone='555-555-3949',...interest=set(['cats','llamas']))>>>print(tostring(form))<html> <body> <form> Your name: <input name="name" type="text" value="John Smith"> <br>Your phone: <input name="phone" type="text" value="555-555-3949"> <br>Your favorite pets: <br>Dogs: <input name="interest" type="checkbox" value="dogs"> <br>Cats: <input checked name="interest" type="checkbox" value="cats"> <br>Llamas: <input checked name="interest" type="checkbox" value="llamas"> <br> <input type="submit"> </form> </body></html>
You can submit a form withlxml.html.submit_form(form_element).This will return a file-like object (the result ofurllib.urlopen()).
If you have extra input values you want to pass you can use thekeyword argumentextra_values, likeextra_values={'submit':'Yes!'}. This is the only way to get submit values into the form,as there is no state of "submitted" for these elements.
You can pass in an alternate opener with theopen_http keywordargument, which is a function with the signatureopen_http(method,url, values).
Example:
>>>fromlxml.htmlimportparse,submit_form>>>page=parse('http://tinyurl.com').getroot()>>>page.forms[0].fields['url']='http://lxml.de/'>>>result=parse(submit_form(page.forms[0])).getroot()>>>[a.attrib['href']forainresult.xpath("//a[@target='_blank']")]['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
The modulelxml.html.diff offers some ways to visualizedifferences in HTML documents. These differences arecontentoriented. That is, changes in markup are largely ignored; onlychanges in the content itself are highlighted.
There are two ways to view differences:htmldiff andhtml_annotate. One shows differences with<ins> and<del>, while the other annotates a set of changes similar tosvnblame. Both these functions operate on text, and work best withcontent fragments (only what goes in<body>), not completedocuments.
Example ofhtmldiff:
>>>fromlxml.html.diffimporthtmldiff,html_annotate>>>doc1='''<p>Here is some text.</p>'''>>>doc2='''<p>Here is <b>a lot</b> of <i>text</i>.</p>'''>>>doc3='''<p>Here is <b>a little</b> <i>text</i>.</p>'''>>>printhtmldiff(doc1,doc2)<p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p>>>>printhtml_annotate([(doc1,'author1'),(doc2,'author2'),...(doc3,'author3')])<p><span title="author1">Here is</span> <b><span title="author2">a</span> <span title="author3">little</span></b> <i><span title="author2">text</span></i> <span title="author2">.</span></p>
As you can see, it is imperfect as such things tend to be. On largertracts of text with larger edits it will generally do better.
Thehtml_annotate function can also take an optional secondargument,markup. This is a function likemarkup(text,version) that returns the given text marked up with the givenversion. The default version, the output of which you see in theexample, looks like:
defdefault_markup(text,version):return'<span title="%s">%s</span>'%(cgi.escape(unicode(version),1),text)
This example parses thehCardmicroformat.
First we get the page:
>>>importurllib>>>fromlxml.htmlimportfromstring>>>url='http://microformats.org/'>>>content=urllib.urlopen(url).read()>>>doc=fromstring(content)>>>doc.make_links_absolute(url)
Then we create some objects to put the information in:
>>>classCard(object):...def__init__(self,**kw):...forname,valueinkw:...setattr(self,name,value)>>>classPhone(object):...def__init__(self,phone,types=()):...self.phone,self.types=phone,types
And some generally handy functions for microformats:
>>>defget_text(el,class_name):...els=el.find_class(class_name)...ifels:...returnels[0].text_content()...else:...return''>>>defget_value(el):...returnget_text(el,'value')orel.text_content()>>>defget_all_texts(el,class_name):...return[e.text_content()foreinels.find_class(class_name)]>>>defparse_addresses(el):...# Ideally this would parse street, etc....returnel.find_class('adr')
Then the parsing:
>>>forelindoc.find_class('hcard'):...card=Card()...card.el=el...card.fn=get_text(el,'fn')...card.tels=[]...fortel_elincard.find_class('tel'):...card.tels.append(Phone(get_value(tel_el),...get_all_texts(tel_el,'type')))...card.addresses=parse_addresses(el)