- Notifications
You must be signed in to change notification settings - Fork0
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
License
twm/html5lib-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.
Simple usage follows this pattern:
importhtml5libwithopen("mydocument.html","rb")asf:document=html5lib.parse(f)
or:
importhtml5libdocument=html5lib.parse("<p>Hello World!")
By default, thedocument
will be anxml.etree
element instance.Whenever possible, html5lib chooses the acceleratedElementTree
implementation (i.e.xml.etree.cElementTree
on Python 2.x).
Two other tree types are supported:xml.dom.minidom
andlxml.etree
. To use an alternative format, specify the name ofa treebuilder:
importhtml5libwithopen("mydocument.html","rb")asf:lxml_etree_document=html5lib.parse(f,treebuilder="lxml")
When using withurllib2
(Python 2), the charset from HTTP should bepass into html5lib as follows:
fromcontextlibimportclosingfromurllib2importurlopenimporthtml5libwithclosing(urlopen("http://example.com/"))asf:document=html5lib.parse(f,transport_encoding=f.info().getparam("charset"))
When using withurllib.request
(Python 3), the charset from HTTPshould be pass into html5lib as follows:
fromurllib.requestimporturlopenimporthtml5libwithurlopen("http://example.com/")asf:document=html5lib.parse(f,transport_encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:
importhtml5libwithopen("mydocument.html","rb")asf:parser=html5lib.HTMLParser(strict=True)document=parser.parse(f)
When you're instantiating parser objects explicitly, pass a treebuilderclass as thetree
keyword argument to use an alternative documentformat:
importhtml5libparser=html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))minidom_document=parser.parse("<p>Hello World!")
More documentation is available athttps://html5lib.readthedocs.io/.
html5lib works on CPython 2.6+, CPython 3.3+ and PyPy. To install it,use:
$ pip install html5lib
The following third-party libraries may be used for additionalfunctionality:
datrie
can be used under CPython to improve parsing performance(though in almost all cases the improvement is marginal);lxml
is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);genshi
has a treewalker (but not builder); andchardet
can be used as a fallback when character encoding cannotbe determined.
Please report any bugs on theissue tracker.
Unit tests require thepytest
andmock
libraries and can berun using thepy.test
command in the root directory;ordereddict
is required under Python 2.6. All should pass.
Test data are contained in a separatehtml5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:
$ git submodule init$ git submodule update
If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox
utility,which can be found on PyPI.
There's a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.
About
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- Python99.9%
- Other0.1%