- Notifications
You must be signed in to change notification settings - Fork0
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
License
codereverser/html5lib-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.
Simple usage follows this pattern:
import html5libwith open("mydocument.html", "rb") as f: document = html5lib.parse(f)
or:
import html5libdocument = html5lib.parse("<p>Hello World!")
By default, thedocument
will be anxml.etree
element instance.Whenever possible, html5lib chooses the acceleratedElementTree
implementation (i.e.xml.etree.cElementTree
on Python 2.x).
Two other tree types are supported:xml.dom.minidom
andlxml.etree
. To use an alternative format, specify the name ofa treebuilder:
import html5libwith open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using withurllib2
(Python 2), the charset from HTTP should bepass into html5lib as follows:
from contextlib import closingfrom urllib2 import urlopenimport html5libwith closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
When using withurllib.request
(Python 3), the charset from HTTPshould be pass into html5lib as follows:
from urllib.request import urlopenimport html5libwith urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:
import html5libwith open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)
When you're instantiating parser objects explicitly, pass a treebuilderclass as thetree
keyword argument to use an alternative documentformat:
import html5libparser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))minidom_document = parser.parse("<p>Hello World!")
More documentation is available athttps://html5lib.readthedocs.io/.
html5lib works on CPython 2.6+, CPython 3.3+ and PyPy. To install it,use:
$ pip install html5lib
The following third-party libraries may be used for additionalfunctionality:
datrie
can be used under CPython to improve parsing performance(though in almost all cases the improvement is marginal);lxml
is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);genshi
has a treewalker (but not builder); andchardet
can be used as a fallback when character encoding cannotbe determined.
Please report any bugs on theissue tracker.
Unit tests require thepytest
andmock
libraries and can berun using thepy.test
command in the root directory;ordereddict
is required under Python 2.6. All should pass.
Test data are contained in a separatehtml5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:
$ git submodule init$ git submodule update
If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox
utility,which can be found on PyPI.
There's a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.
About
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- Python99.9%
- Other0.1%