@@ -4,22 +4,25 @@ The moving parts
44html5lib consists of a number of components, which are responsible for
55handling its features.
66
7+ Parsing uses a *tree builder * to generate a *tree *, the in-memory representation of the document.
8+ Several tree representations are supported, as are translations to other formats via *tree adapters *.
9+ The tree may be translated to a token stream with a *tree walker *, from which:class: `~html5lib.serializer.HTMLSerializer ` produces a stream of bytes.
10+ The token stream may also be transformed by use of *filters * to accomplish tasks like sanitization.
711
812Tree builders
913-------------
1014
1115The parser reads HTML by tokenizing the content and building a tree that
12- the user can later access. There are three main types of trees that
13- html5lib can build:
16+ the user can later access. html5lib can build three types of trees:
1417
15- * ``etree `` - this is the default; builds a tree based on`` xml.etree ` `,
18+ * ``etree `` - this is the default; builds a tree based on:mod: ` xml.etree `,
1619 which can be found in the standard library. Whenever possible, the
1720 accelerated ``ElementTree `` implementation (i.e.
1821 ``xml.etree.cElementTree `` on Python 2.x) is used.
1922
20- * ``dom `` - builds a tree based on`` xml.dom.minidom ` `.
23+ * ``dom `` - builds a tree based on:mod: ` xml.dom.minidom `.
2124
22- * ``lxml.etree `` - uses lxml's implementation of the ``ElementTree ``
25+ * ``lxml `` - usesthe :mod: ` lxml.etree ` implementation of the ``ElementTree ``
2326 API. The performance gains are relatively small compared to using the
2427 accelerated ``ElementTree `` module.
2528
@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134with open (" mydocument.html" ," rb" )as f:
3235 lxml_etree_document= html5lib.parse(f,treebuilder = " lxml" )
3336
34- When instantiating a parser object, you have to pass a tree builder
35- class in the ``tree `` keyword attribute:
37+ To get a builder class by name, use the:func: `~html5lib.treebuilders.getTreeBuilder ` function.
3638
37- ..code-block ::python
38-
39- import html5lib
40- parser= html5lib.HTMLParser(tree = SomeTreeBuilder)
41- document= parser.parse(" <p>Hello World!" )
42-
43- To get a builder class by name, use the ``getTreeBuilder `` function:
39+ When instantiating a:class: `~html5lib.html5parser.HTMLParser ` object, you must pass a tree builder class via the ``tree `` keyword attribute:
4440
4541..code-block ::python
4642
4743import html5lib
48- parser= html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
44+ TreeBuilder= html5lib.getTreeBuilder(" dom" )
45+ parser= html5lib.HTMLParser(tree = TreeBuilder)
4946 minidom_document= parser.parse(" <p>Hello World!" )
5047
5148 The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552Tree walkers
5653------------
5754
58- Once a tree is ready, you can work on it either manually, or using
59- a tree walker, which provides a streaming view of the tree. html5lib
60- provides walkers for all three supported types of trees (``etree ``,
61- ``dom `` and ``lxml ``).
55+ In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+ html5lib provides walkers for ``etree ``, ``dom ``, and ``lxml `` trees, as well as ``genshi `` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
6257
6358The implementation of walkers can be found in `html5lib/treewalkers/
6459<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers> `_.
6560
66- Walkers make consuming HTML easier. html5lib uses them to provide you
67- with has a couple of handy tools.
61+ html5lib provides a few tools for consuming token streams:
6862
63+ *:class: `~html5lib.serializer.HTMLSerializer `, to generate a stream of bytes; and
64+ * filters, to manipulate the token stream.
6965
7066HTMLSerializer
7167~~~~~~~~~~~~~~
@@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes.
9086 '>'
9187 'Witam wszystkich'
9288
93- You can customize the serializer behaviour in a variety of ways, consult
94- the:class: `~html5lib.serializer.htmlserializer.HTMLSerializer `
95- documentation.
89+ You can customize the serializer behaviour in a variety of ways. Consult
90+ the:class: `~html5lib.serializer.HTMLSerializer ` documentation.
9691
9792
9893Filters
9994~~~~~~~
10095
101- You can alter the stream content with filters provided by html5lib:
96+ html5lib provides several filters
10297
10398*:class: `alphabeticalattributes.Filter
10499 <html5lib.filters.alphabeticalattributes.Filter> ` sorts attributes on
@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
110105 the document
111106
112107*:class: `lint.Filter <html5lib.filters.lint.Filter> ` raises
113- `` LintError ` ` exceptions on invalid tag and attribute names, invalid
108+ :exc: ` AssertionError ` exceptions on invalid tag and attribute names, invalid
114109 PCDATA, etc.
115110
116111*:class: `optionaltags.Filter <html5lib.filters.optionaltags.Filter> `
117- removes tags from the stream which are not necessary to produce valid
112+ removes tags from thetoken stream which are not necessary to produce valid
118113 HTML
119114
120115*:class: `sanitizer.Filter <html5lib.filters.sanitizer.Filter> ` removes
@@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib:
125120
126121*:class: `whitespace.Filter <html5lib.filters.whitespace.Filter> `
127122 collapses all whitespace characters to single spaces unless they're in
128- ``<pre/> `` or ``textarea `` tags.
123+ ``<pre/> `` or ``< textarea/> `` tags.
129124
130- To use a filter, simply wrap it around a stream:
125+ To use a filter, simply wrap it around atoken stream:
131126
132127..code-block ::python
133128
@@ -142,9 +137,11 @@ To use a filter, simply wrap it around a stream:
142137 Tree adapters
143138-------------
144139
145- Used to translate one type of tree toanother. More documentation
146- pending, sorry.
140+ Tree adapters can be used totranslate between tree formats.
141+ Two adapters are provided by html5lib:
147142
143+ *:func: `html5lib.treeadapters.genshi.to_genshi() ` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
144+ *:func: `html5lib.treeadapters.sax.to_sax() ` calls a SAX handler based on the tree.
148145
149146Encoding discovery
150147------------------
@@ -156,14 +153,14 @@ the following way:
156153* The encoding may be explicitly specified by passing the name of the
157154 encoding as the encoding parameter to the
158155:meth: `~html5lib.html5parser.HTMLParser.parse ` method on
159- `` HTMLParser ` ` objects.
156+ :class: ` ~html5lib.html5parser. HTMLParser ` objects.
160157
161158* If no encoding is specified, the parser will attempt to detect the
162159 encoding from a ``<meta> `` element in the first 512 bytes of the
163160 document (this is only a partial implementation of the current HTML
164- 5 specification).
161+ specification).
165162
166- * If no encoding can be found and the chardet library is available, an
163+ * If no encoding can be found and the:mod: ` chardet ` library is available, an
167164 attempt will be made to sniff the encoding from the byte pattern.
168165
169166* If all else fails, the default encoding will be used. This is usually