11html5lib
22========
33
4+ ..image ::https://travis-ci.org/html5lib/html5lib-python.png?branch=master
5+ :target: https://travis-ci.org/html5lib/html5lib-python
6+
47html5lib is a pure-python library for parsing HTML. It is designed to
58conform to the WHATWG HTML specification, as is implemented by all major
69web browsers.
710
811
9- Requirements
10- ------------
12+ Usage
13+ -----
1114
12- Python 2.6 and above as well as Python 3.0 and above are
13- supported. Implementations known to work are CPython (as the reference
14- implementation) and PyPy. Jython is known *not * to work due to various
15- bugs in its implementation of the language. Others such as IronPython
16- may or may not work; if you wish to try, you are strongly encouraged
17- to run the testsuite and report back!
15+ Simple usage follows this pattern:
1816
19- The only required library dependency is ``six ``, this can be found
20- packaged in PyPI.
17+ ..code-block ::python
2118
22- Optionally:
19+ import html5lib
20+ with open (" mydocument.html" ," rb" )as f:
21+ document= html5lib.parse(f)
2322
24- - ``datrie `` can be used to improve parsing performance (though in
25- almost all cases the improvement is marginal);
23+ or:
2624
27- - ``lxml `` is supported as a tree format (for both building and
28- walking) under CPython (but *not * PyPy where it is known to cause
29- segfaults);
25+ ..code-block ::python
3026
31- - ``genshi `` has a treewalker (but not builder); and
27+ import html5lib
28+ document= html5lib.parse(" <p>Hello World!" )
3229
33- - ``charade `` can be used as a fallback when character encoding cannot
34- be determined; ``chardet ``, from which it was forked, can also be used
35- on Python 2.
30+ By default, the ``document `` will be an ``xml.etree `` element instance.
31+ Whenever possible, html5lib chooses the accelerated ``ElementTree ``
32+ implementation (i.e. ``xml.etree.cElementTree `` on Python 2.x).
33+
34+ Two other tree types are supported: ``xml.dom.minidom `` and
35+ ``lxml.etree ``. To use an alternative format, specify the name of
36+ a treebuilder:
37+
38+ ..code-block ::python
39+
40+ import html5lib
41+ with open (" mydocument.html" ," rb" )as f:
42+ lxml_etree_document= html5lib.parse(f,treebuilder = " lxml" )
43+
44+ To have more control over the parser, create a parser object explicitly.
45+ For instance, to make the parser raise exceptions on parse errors, use:
46+
47+ ..code-block ::python
48+
49+ import html5lib
50+ with open (" mydocument.html" ," rb" )as f:
51+ parser= html5lib.HTMLParser(strict = True )
52+ document= parser.parse(f)
53+
54+ When you're instantiating parser objects explicitly, pass a treebuilder
55+ class as the ``tree `` keyword argument to use an alternative document
56+ format:
57+
58+ ..code-block ::python
59+
60+ import html5lib
61+ parser= html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
62+ minidom_document= parser.parse(" <p>Hello World!" )
63+
64+ More documentation is available at http://html5lib.readthedocs.org/.
3665
3766
3867Installation
3968------------
4069
41- html5lib is packaged with distutils. To install it use::
70+ html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
71+ use:
4272
43- $ python setup.py install
73+ .. code-block :: bash
4474
75+ $ pip install html5lib
4576
46- Usage
47- -----
4877
49- Simple usage follows this pattern::
78+ Optional Dependencies
79+ ---------------------
5080
51- import html5lib
52- with open("mydocument.html", "r") as fp:
53- document = html5lib.parse(f)
81+ The following third-party libraries may be used for additional
82+ functionality:
5483
55- or::
84+ - ``datrie `` can be used to improve parsing performance (though in
85+ almost all cases the improvement is marginal);
5686
57- import html5lib
58- document = html5lib.parse("<p>Hello World!")
87+ - ``lxml `` is supported as a tree format (for both building and
88+ walking) under CPython (but *not * PyPy where it is known to cause
89+ segfaults);
5990
60- More documentation is available in the docstrings.
91+ - ``genshi `` has a treewalker (but not builder); and
92+
93+ - ``charade `` can be used as a fallback when character encoding cannot
94+ be determined; ``chardet ``, from which it was forked, can also be used
95+ on Python 2.
6196
6297
6398Bugs
@@ -70,28 +105,21 @@ Please report any bugs on the `issue tracker
70105Tests
71106-----
72107
73- These are contained in the html5lib-tests repository and included as a
74- submodule, thus for git checkouts they must be initialized (for
75- release tarballs this is unneeded)::
108+ Unit tests require the ``nose `` library and can be run using the
109+ ``nosetests `` command in the root directory. All should pass.
110+
111+ Test data are contained in a separate `html5lib-tests
112+ <https://github.com/html5lib/html5lib-tests> `_ repository and included
113+ as a submodule, thus for git checkouts they must be initialized::
76114
77115 $ git submodule init
78116 $ git submodule update
79117
80- And then they can be run, with ``nose `` installed, using the
81- ``nosetests `` command in the root directory. All should pass.
118+ This is unneeded for release tarballs.
82119
83120If you have all compatible Python implementations available on your
84- system, you can run tests on all of them by using tox::
85-
86- $ pip install tox
87- $ tox
88- ...
89- _______________________ summary ______________________
90- py26: commands succeeded
91- py27: commands succeeded
92- py32: commands succeeded
93- py33: commands succeeded
94- congratulations :)
121+ system, you can run tests on all of them using the ``tox `` utility,
122+ which can be found on PyPI.
95123
96124
97125Contributing
@@ -121,5 +149,5 @@ Questions?
121149
122150There's a mailing list available for support on Google Groups,
123151`html5lib-discuss <http://groups.google.com/group/html5lib-discuss >`_,
124- though you mayhave more success (and get a far quicker response)
125- asking on IRC in #whatwg on irc.freenode.net.
152+ though you mayget a quicker response asking on IRC in #whatwg on
153+ irc.freenode.net.