Movatterモバイル変換


[0]ホーム

URL:


Menu
Like the tool?
Help making it better!
Your donation helps!
Like the tool?
Help making it better!
Your donation helps!

lxml.html

Author:Ian Bicking

Since version 2.0, lxml comes with a dedicated Python package fordealing with HTML:lxml.html. It is based on lxml's HTML parser,but provides a special Element API for HTML elements, as well as anumber of utilities for common HTML processing tasks.

Contents

The main API is based on thelxml.etree API, and thus, on theElementTreeAPI.

Parsing HTML

Parsing HTML fragments

There are several functions available to parse HTML:

parse(filename_url_or_file):

Parses the named file or url, or if the object has a.read()method, parses from that.

If you give a URL, or if the object has a.geturl() method (asfile-like objects fromurllib.urlopen() have), then that URLis used as the base URL. You can also provide an explicitbase_url keyword argument.

document_fromstring(string):
Parses a document from the given string. This always creates acorrect HTML document, which means the parent node is<html>,and there is a body and possibly a head.
fragment_fromstring(string, create_parent=False):
Returns an HTML fragment from a string. The fragment must containjust a single element, unlesscreate_parent is given;e.g.,fragment_fromstring(string,create_parent='div') willwrap the element in a<div>.
fragments_fromstring(string):
Returns a list of the elements found in the fragment.
fromstring(string):
Returnsdocument_fromstring orfragment_fromstring, basedon whether the string looks like a full document, or just afragment.

Really broken pages

The normal HTML parser is capable of handling broken HTML, but forpages that are far enough from HTML to call them 'tag soup', it maystill fail to parse the page in a useful way. A way to deal with thisisElementSoup, which deploys the well-knownBeautifulSoup parser tobuild an lxml HTML tree.

However, note that the most common problem with web pages is the lackof (or the existence of incorrect) encoding declarations. It istherefore often sufficient to only use the encoding detection ofBeautifulSoup, called UnicodeDammit, and to leave the rest to lxml'sown HTML parser, which is several times faster.

HTML Element Methods

HTML elements have all the methods that come with ElementTree, butalso include some extra methods:

.drop_tree():
Drops the element and all its children. Unlikeel.getparent().remove(el) this doesnot remove the tailtext; withdrop_tree the tail text is merged with the previouselement.
.drop_tag():
Drops the tag, but keeps its children and text.
.find_class(class_name):
Returns a list of all the elements with the given CSS class name.Note that class names are space separated in HTML, sodoc.find_class_name('highlight') will find an element like<divclass="sidebar highlight">. Class namesare casesensitive.
.find_rel_links(rel):
Returns a list of all the<arel="{rel}"> elements. E.g.,doc.find_rel_links('tag') returns all the linksmarked astags.
.get_element_by_id(id, default=None):
Return the element with the givenid, or thedefault ifnone is found. If there are multiple elements with the same id(which there shouldn't be, but there often is), this returns onlythe first.
.text_content():
Returns the text content of the element, including the textcontent of its children, with no markup.
.cssselect(expr):
Select elements from this element and its children, using a CSSselector expression. (Note that.xpath(expr) is alsoavailable as on all lxml elements.)
.label:
Returns the corresponding<label> element for this element, ifany exists (None if there is none). Label elements have alabel.for_element attribute that points back to the element.
.base_url:
The base URL for this element, if one was saved from the parsing.This attribute is not settable. Is None when no base URL wassaved.
.classes:
Returns a set-like object that allows accessing and modifying thenames in the 'class' attribute of the element. (New in lxml 3.5).
.set(key, value=None):
Sets an HTML attribute. If no value is given, or if the value isNone, it creates a boolean attribute like<formnovalidate></form>or<divcustom-attribute></div>. In XML, attributes musthave at least the empty string as their value like<formnovalidate=""></form>, but HTML boolean attributes can also bejust present or absent from an element without having a value.

Running HTML doctests

One of the interesting modules in thelxml.html package deals withdoctests. It can be hard to compare two HTML pages for equality, aswhitespace differences aren't meaningful and the structural formattingcan differ. This is even more a problem in doctests, where output istested for equality and small differences in whitespace or the orderof attributes can let a test fail. And given the verbosity oftag-based languages, it may take more than a quick look to find theactual differences in the doctest output.

Luckily, lxml provides thelxml.doctestcompare module thatsupports relaxed comparison of XML and HTML pages and provides areadable diff in the output when a test fails. The HTML comparison ismost easily used by importing theusedoctest module in a doctest:

>>>importlxml.html.usedoctest

Now, if you have an HTML document and want to compare it to an expected resultdocument in a doctest, you can do the following:

>>>importlxml.html>>>html=lxml.html.fromstring('''\...   <html><body onload="" color="white">...     <p>Hi  !</p>...   </body></html>...''')>>>printlxml.html.tostring(html)<html><body onload="" color="white"><p>Hi !</p></body></html>>>>printlxml.html.tostring(html)<html> <body color="white" onload=""> <p>Hi    !</p> </body> </html>>>>printlxml.html.tostring(html)<html>  <body color="white" onload="">    <p>Hi !</p>  </body></html>

In documentation, you would likely prefer the pretty printed HTML output, asit is the most readable. However, the three documents are equivalent from thepoint of view of an HTML tool, so the doctest will silently accept any of theabove. This allows you to concentrate on readability in your doctests, evenif the real output is a straight ugly HTML one-liner.

Note that there is also anlxml.usedoctest module which you canimport for XML comparisons. The HTML parser notably ignoresnamespaces and some other XMLisms.

Creating HTML with the E-factory

lxml.html comes with a predefined HTML vocabulary for theE-factory,originally written by Fredrik Lundh. This allows you to quickly generate HTMLpages and fragments:

>>>fromlxml.htmlimportbuilderasE>>>fromlxml.htmlimportusedoctest>>>html=E.HTML(...E.HEAD(...E.LINK(rel="stylesheet",href="great.css",type="text/css"),...E.TITLE("Best Page Ever")...),...E.BODY(...E.H1(E.CLASS("heading"),"Top News"),...E.P("World News only on this page",style="font-size: 200%"),..."Ah, and here's some more text, by the way.",...lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")...)...)>>>printlxml.html.tostring(html)<html>  <head>    <link href="great.css" rel="stylesheet" type="text/css">    <title>Best Page Ever</title>  </head>  <body>    <h1>Top News</h1>    <p>World News only on this page</p>    Ah, and here's some more text, by the way.    <p>... and this is a parsed fragment ...</p>  </body></html>

Note that you should uselxml.html.tostring andnotlxml.tostring.lxml.tostring(doc) will return the XMLrepresentation of the document, which is not valid HTML. Inparticular, things like<scriptsrc="..."></script> will beserialized as<scriptsrc="..." />, which completely confusesbrowsers.

Viewing your HTML

A handy method for viewing your HTML:lxml.html.open_in_browser(lxml_doc) will write the document todisk and open it in a browser (with thewebbrowser module).

Working with links

There are several methods on elements that allow you to see and modifythe links in a document.

.iterlinks():

This yields(element, attribute, link, pos) for every link inthe document.attribute may be None if the link is in thetext (as will be the case with a<style> tag with@import).

This finds any link in anaction,archive,background,cite,classid,codebase,data,href,longdesc,profile,src,usemap,dynsrc, orlowsrc attribute. It also searchesstyle attributes forurl(link), and<style> tags for@import andurl().

This function doesnot pay attention to<base href>.

.resolve_base_href():
This function will modify the document in-place to take account of<base href> if the document contains that tag. In the processit will also remove that tag from the document.
.make_links_absolute(base_href, resolve_base_href=True):

This makes all links in the document absolute, assuming thatbase_href is the URL of the document. So if you passbase_href="http://localhost/foo/bar.html" and there is a linktobaz.html that will be rewritten ashttp://localhost/foo/baz.html.

Ifresolve_base_href is true, then any<base href> tagwill be taken into account (just callingself.resolve_base_href()).

.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None):

This rewrites all the links in the document using your given linkreplacement function. If you give abase_href value, alllinks will be passed in after they are joined with this URL.

For each linklink_repl_func(link) is called. That functionthen returns the new link, or None to remove the attribute or tagthat contains the link. Note that all links will be passed in,including links like"#anchor" (which is purely internal), andthings like"mailto:bob@example.com" (or#"text" name="name"> <br>... Your phone: <input type="text" name="phone"> <br>... Your favorite pets: <br>... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>... Cats: <input type="checkbox" name="interest" value="cats"> <br>... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>... <input type="submit"></form></body></html>''')>>>form=form_page.forms[0]>>>form.fields=dict(...name='John Smith',...phone='555-555-3949',...interest=set(['cats','llamas']))>>>print(tostring(form))<html> <body> <form> Your name: <input name="name" type="text" value="John Smith"> <br>Your phone: <input name="phone" type="text" value="555-555-3949"> <br>Your favorite pets: <br>Dogs: <input name="interest" type="checkbox" value="dogs"> <br>Cats: <input checked name="interest" type="checkbox" value="cats"> <br>Llamas: <input checked name="interest" type="checkbox" value="llamas"> <br> <input type="submit"> </form> </body></html>

Form Submission

You can submit a form withlxml.html.submit_form(form_element).This will return a file-like object (the result ofurllib.urlopen()).

If you have extra input values you want to pass you can use thekeyword argumentextra_values, likeextra_values={'submit':'Yes!'}. This is the only way to get submit values into the form,as there is no state of "submitted" for these elements.

You can pass in an alternate opener with theopen_http keywordargument, which is a function with the signatureopen_http(method,url, values).

Example:

>>>fromlxml.htmlimportparse,submit_form>>>page=parse('http://tinyurl.com').getroot()>>>page.forms[0].fields['url']='http://lxml.de/'>>>result=parse(submit_form(page.forms[0])).getroot()>>>[a.attrib['href']forainresult.xpath("//a[@target='_blank']")]['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']

HTML Diff

The modulelxml.html.diff offers some ways to visualizedifferences in HTML documents. These differences arecontentoriented. That is, changes in markup are largely ignored; onlychanges in the content itself are highlighted.

There are two ways to view differences:htmldiff andhtml_annotate. One shows differences with<ins> and<del>, while the other annotates a set of changes similar tosvnblame. Both these functions operate on text, and work best withcontent fragments (only what goes in<body>), not completedocuments.

Example ofhtmldiff:

>>>fromlxml.html.diffimporthtmldiff,html_annotate>>>doc1='''<p>Here is some text.</p>'''>>>doc2='''<p>Here is <b>a lot</b> of <i>text</i>.</p>'''>>>doc3='''<p>Here is <b>a little</b> <i>text</i>.</p>'''>>>printhtmldiff(doc1,doc2)<p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p>>>>printhtml_annotate([(doc1,'author1'),(doc2,'author2'),...(doc3,'author3')])<p><span title="author1">Here is</span>   <b><span title="author2">a</span>   <span title="author3">little</span></b>   <i><span title="author2">text</span></i>   <span title="author2">.</span></p>

As you can see, it is imperfect as such things tend to be. On largertracts of text with larger edits it will generally do better.

Thehtml_annotate function can also take an optional secondargument,markup. This is a function likemarkup(text,version) that returns the given text marked up with the givenversion. The default version, the output of which you see in theexample, looks like:

defdefault_markup(text,version):return'<span title="%s">%s</span>'%(cgi.escape(unicode(version),1),text)

Examples

Microformat Example

This example parses thehCardmicroformat.

First we get the page:

>>>importurllib>>>fromlxml.htmlimportfromstring>>>url='http://microformats.org/'>>>content=urllib.urlopen(url).read()>>>doc=fromstring(content)>>>doc.make_links_absolute(url)

Then we create some objects to put the information in:

>>>classCard(object):...def__init__(self,**kw):...forname,valueinkw:...setattr(self,name,value)>>>classPhone(object):...def__init__(self,phone,types=()):...self.phone,self.types=phone,types

And some generally handy functions for microformats:

>>>defget_text(el,class_name):...els=el.find_class(class_name)...ifels:...returnels[0].text_content()...else:...return''>>>defget_value(el):...returnget_text(el,'value')orel.text_content()>>>defget_all_texts(el,class_name):...return[e.text_content()foreinels.find_class(class_name)]>>>defparse_addresses(el):...# Ideally this would parse street, etc....returnel.find_class('adr')

Then the parsing:

>>>forelindoc.find_class('hcard'):...card=Card()...card.el=el...card.fn=get_text(el,'fn')...card.tels=[]...fortel_elincard.find_class('tel'):...card.tels.append(Phone(get_value(tel_el),...get_all_texts(tel_el,'type')))...card.addresses=parse_addresses(el)

Generated on: 2024-08-10.

[8]ページ先頭

©2009-2025 Movatter.jp