@@ -4,22 +4,25 @@ The moving parts
44html5lib consists of a number of components, which are responsible for
55handling its features.
66
7+ Parsing uses a *tree builder * to generate a *tree *, the in-memory representation of the document.
8+ Several tree representations are supported, as are translations to other formats via *tree adapters *.
9+ The tree may be translated to a token stream with a *tree walker *, from which:class: `~html5lib.serializer.HTMLSerializer ` produces a stream of bytes.
10+ The token stream may also be transformed by use of *filters * to accomplish tasks like sanitization.
711
812Tree builders
913-------------
1014
1115The parser reads HTML by tokenizing the content and building a tree that
12- the user can later access. There are three main types of trees that
13- html5lib can build:
16+ the user can later access. html5lib can build three types of trees:
1417
15- * ``etree `` - this is the default; builds a tree based on`` xml.etree ` `,
18+ * ``etree `` - this is the default; builds a tree based on:mod: ` xml.etree `,
1619 which can be found in the standard library. Whenever possible, the
1720 accelerated ``ElementTree `` implementation (i.e.
1821 ``xml.etree.cElementTree `` on Python 2.x) is used.
1922
20- * ``dom `` - builds a tree based on`` xml.dom.minidom ` `.
23+ * ``dom `` - builds a tree based on:mod: ` xml.dom.minidom `.
2124
22- * ``lxml.etree `` - uses lxml's implementation of the ``ElementTree ``
25+ * ``lxml `` - usesthe :mod: ` lxml.etree ` implementation of the ``ElementTree ``
2326 API. The performance gains are relatively small compared to using the
2427 accelerated ``ElementTree `` module.
2528
@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134with open (" mydocument.html" ," rb" )as f:
3235 lxml_etree_document= html5lib.parse(f,treebuilder = " lxml" )
3336
34- When instantiating a parser object, you have to pass a tree builder
35- class in the ``tree `` keyword attribute:
37+ To get a builder class by name, use the:func: `~html5lib.treebuilders.getTreeBuilder ` function.
3638
37- ..code-block ::python
38-
39- import html5lib
40- parser= html5lib.HTMLParser(tree = SomeTreeBuilder)
41- document= parser.parse(" <p>Hello World!" )
42-
43- To get a builder class by name, use the ``getTreeBuilder `` function:
39+ When instantiating a:class: `~html5lib.html5parser.HTMLParser ` object, you must pass a tree builder class via the ``tree `` keyword attribute:
4440
4541..code-block ::python
4642
4743import html5lib
48- parser= html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
44+ TreeBuilder= html5lib.getTreeBuilder(" dom" )
45+ parser= html5lib.HTMLParser(tree = TreeBuilder)
4946 minidom_document= parser.parse(" <p>Hello World!" )
5047
5148 The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552Tree walkers
5653------------
5754
58- Once a tree is ready, you can work on it either manually, or using
59- a tree walker, which provides a streaming view of the tree. html5lib
60- provides walkers for all three supported types of trees (``etree ``,
61- ``dom `` and ``lxml ``).
55+ In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+ html5lib provides walkers for ``etree ``, ``dom ``, and ``lxml `` trees, as well as ``genshi `` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
6257
6358The implementation of walkers can be found in `html5lib/treewalkers/
6459<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers> `_.
6560
66- Walkers make consuming HTML easier. html5lib uses them to provide you
67- with has a couple of handy tools.
68-
61+ html5lib provides:class: `~html5lib.serializer.HTMLSerializer ` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
6962
7063HTMLSerializer
7164~~~~~~~~~~~~~~
@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
9083 '>'
9184 'Witam wszystkich'
9285
93- You can customize the serializer behaviour in a variety of ways, consult
94- the:class: `~html5lib.serializer.htmlserializer.HTMLSerializer `
95- documentation.
86+ You can customize the serializer behaviour in a variety of ways. Consult
87+ the:class: `~html5lib.serializer.HTMLSerializer ` documentation.
9688
9789
9890Filters
9991~~~~~~~
10092
101- You can alter the stream content with filters provided by html5lib :
93+ html5lib provides several filters:
10294
10395*:class: `alphabeticalattributes.Filter
10496 <html5lib.filters.alphabeticalattributes.Filter> ` sorts attributes on
@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
110102 the document
111103
112104*:class: `lint.Filter <html5lib.filters.lint.Filter> ` raises
113- `` LintError ` ` exceptions on invalid tag and attribute names, invalid
105+ :exc: ` AssertionError ` exceptions on invalid tag and attribute names, invalid
114106 PCDATA, etc.
115107
116108*:class: `optionaltags.Filter <html5lib.filters.optionaltags.Filter> `
117- removes tags from the stream which are not necessary to produce valid
109+ removes tags from thetoken stream which are not necessary to produce valid
118110 HTML
119111
120112*:class: `sanitizer.Filter <html5lib.filters.sanitizer.Filter> ` removes
@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:
125117
126118*:class: `whitespace.Filter <html5lib.filters.whitespace.Filter> `
127119 collapses all whitespace characters to single spaces unless they're in
128- ``<pre/> `` or ``textarea `` tags.
120+ ``<pre/> `` or ``< textarea/> `` tags.
129121
130- To use a filter, simply wrap it around a stream:
122+ To use a filter, simply wrap it around atoken stream:
131123
132124..code-block ::python
133125
@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
142134 Tree adapters
143135-------------
144136
145- Used to translate one type of tree toanother. More documentation
146- pending, sorry.
137+ Tree adapters can be used totranslate between tree formats.
138+ Two adapters are provided by html5lib:
147139
140+ *:func: `html5lib.treeadapters.genshi.to_genshi() ` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
141+ *:func: `html5lib.treeadapters.sax.to_sax() ` calls a SAX handler based on the tree.
148142
149143Encoding discovery
150144------------------
@@ -156,54 +150,16 @@ the following way:
156150* The encoding may be explicitly specified by passing the name of the
157151 encoding as the encoding parameter to the
158152:meth: `~html5lib.html5parser.HTMLParser.parse ` method on
159- `` HTMLParser ` ` objects.
153+ :class: ` ~html5lib.html5parser. HTMLParser ` objects.
160154
161155* If no encoding is specified, the parser will attempt to detect the
162156 encoding from a ``<meta> `` element in the first 512 bytes of the
163157 document (this is only a partial implementation of the current HTML
164- 5 specification).
158+ specification).
165159
166- * If no encoding can be found and the chardet library is available, an
160+ * If no encoding can be found and the:mod: ` chardet ` library is available, an
167161 attempt will be made to sniff the encoding from the byte pattern.
168162
169163* If all else fails, the default encoding will be used. This is usually
170164 `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252 >`_, which is
171165 a common fallback used by Web browsers.
172-
173-
174- Tokenizers
175- ----------
176-
177- The part of the parser responsible for translating a raw input stream
178- into meaningful tokens is the tokenizer. Currently html5lib provides
179- two.
180-
181- To set up a tokenizer, simply pass it when instantiating
182- a:class: `~html5lib.html5parser.HTMLParser `:
183-
184- ..code-block ::python
185-
186- import html5lib
187- from html5libimport sanitizer
188-
189- p= html5lib.HTMLParser(tokenizer = sanitizer.HTMLSanitizer)
190- p.parse(" <p>Surprise!<script>alert('Boo!');</script>" )
191-
192- HTMLTokenizer
193- ~~~~~~~~~~~~~
194-
195- This is the default tokenizer, the heart of html5lib. The implementation
196- can be found in `html5lib/tokenizer.py
197- <https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py> `_.
198-
199- HTMLSanitizer
200- ~~~~~~~~~~~~~
201-
202- This is a tokenizer that removes unsafe markup and CSS styles from the
203- input. Elements that are known to be safe are passed through and the
204- rest is converted to visible text. The default configuration of the
205- sanitizer follows the `WHATWG Sanitization Rules
206- <http://wiki.whatwg.org/wiki/Sanitization_rules> `_.
207-
208- The implementation can be found in `html5lib/sanitizer.py
209- <https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py> `_.