@@ -4,22 +4,25 @@ The moving parts
4
4
html5lib consists of a number of components, which are responsible for
5
5
handling its features.
6
6
7
+ Parsing uses a *tree builder * to generate a *tree *, the in-memory representation of the document.
8
+ Several tree representations are supported, as are translations to other formats via *tree adapters *.
9
+ The tree may be translated to a token stream with a *tree walker *, from which:class: `~html5lib.serializer.HTMLSerializer ` produces a stream of bytes.
10
+ The token stream may also be transformed by use of *filters * to accomplish tasks like sanitization.
7
11
8
12
Tree builders
9
13
-------------
10
14
11
15
The parser reads HTML by tokenizing the content and building a tree that
12
- the user can later access. There are three main types of trees that
13
- html5lib can build:
16
+ the user can later access. html5lib can build three types of trees:
14
17
15
- * ``etree `` - this is the default; builds a tree based on`` xml.etree ` `,
18
+ * ``etree `` - this is the default; builds a tree based on:mod: ` xml.etree `,
16
19
which can be found in the standard library. Whenever possible, the
17
20
accelerated ``ElementTree `` implementation (i.e.
18
21
``xml.etree.cElementTree `` on Python 2.x) is used.
19
22
20
- * ``dom `` - builds a tree based on`` xml.dom.minidom ` `.
23
+ * ``dom `` - builds a tree based on:mod: ` xml.dom.minidom `.
21
24
22
- * ``lxml.etree `` - uses lxml's implementation of the ``ElementTree ``
25
+ * ``lxml `` - usesthe :mod: ` lxml.etree ` implementation of the ``ElementTree ``
23
26
API. The performance gains are relatively small compared to using the
24
27
accelerated ``ElementTree `` module.
25
28
@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
31
34
with open (" mydocument.html" ," rb" )as f:
32
35
lxml_etree_document= html5lib.parse(f,treebuilder = " lxml" )
33
36
34
- When instantiating a parser object, you have to pass a tree builder
35
- class in the ``tree `` keyword attribute:
37
+ To get a builder class by name, use the:func: `~html5lib.treebuilders.getTreeBuilder ` function.
36
38
37
- ..code-block ::python
38
-
39
- import html5lib
40
- parser= html5lib.HTMLParser(tree = SomeTreeBuilder)
41
- document= parser.parse(" <p>Hello World!" )
42
-
43
- To get a builder class by name, use the ``getTreeBuilder `` function:
39
+ When instantiating a:class: `~html5lib.html5parser.HTMLParser ` object, you must pass a tree builder class via the ``tree `` keyword attribute:
44
40
45
41
..code-block ::python
46
42
47
43
import html5lib
48
- parser= html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
44
+ TreeBuilder= html5lib.getTreeBuilder(" dom" )
45
+ parser= html5lib.HTMLParser(tree = TreeBuilder)
49
46
minidom_document= parser.parse(" <p>Hello World!" )
50
47
51
48
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
55
52
Tree walkers
56
53
------------
57
54
58
- Once a tree is ready, you can work on it either manually, or using
59
- a tree walker, which provides a streaming view of the tree. html5lib
60
- provides walkers for all three supported types of trees (``etree ``,
61
- ``dom `` and ``lxml ``).
55
+ In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56
+ html5lib provides walkers for ``etree ``, ``dom ``, and ``lxml `` trees, as well as ``genshi `` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
62
57
63
58
The implementation of walkers can be found in `html5lib/treewalkers/
64
59
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers> `_.
65
60
66
- Walkers make consuming HTML easier. html5lib uses them to provide you
67
- with has a couple of handy tools.
68
-
61
+ html5lib provides:class: `~html5lib.serializer.HTMLSerializer ` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
69
62
70
63
HTMLSerializer
71
64
~~~~~~~~~~~~~~
@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
90
83
'>'
91
84
'Witam wszystkich'
92
85
93
- You can customize the serializer behaviour in a variety of ways, consult
94
- the:class: `~html5lib.serializer.htmlserializer.HTMLSerializer `
95
- documentation.
86
+ You can customize the serializer behaviour in a variety of ways. Consult
87
+ the:class: `~html5lib.serializer.HTMLSerializer ` documentation.
96
88
97
89
98
90
Filters
99
91
~~~~~~~
100
92
101
- You can alter the stream content with filters provided by html5lib :
93
+ html5lib provides several filters:
102
94
103
95
*:class: `alphabeticalattributes.Filter
104
96
<html5lib.filters.alphabeticalattributes.Filter> ` sorts attributes on
@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
110
102
the document
111
103
112
104
*:class: `lint.Filter <html5lib.filters.lint.Filter> ` raises
113
- `` LintError ` ` exceptions on invalid tag and attribute names, invalid
105
+ :exc: ` AssertionError ` exceptions on invalid tag and attribute names, invalid
114
106
PCDATA, etc.
115
107
116
108
*:class: `optionaltags.Filter <html5lib.filters.optionaltags.Filter> `
117
- removes tags from the stream which are not necessary to produce valid
109
+ removes tags from thetoken stream which are not necessary to produce valid
118
110
HTML
119
111
120
112
*:class: `sanitizer.Filter <html5lib.filters.sanitizer.Filter> ` removes
@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:
125
117
126
118
*:class: `whitespace.Filter <html5lib.filters.whitespace.Filter> `
127
119
collapses all whitespace characters to single spaces unless they're in
128
- ``<pre/> `` or ``textarea `` tags.
120
+ ``<pre/> `` or ``< textarea/> `` tags.
129
121
130
- To use a filter, simply wrap it around a stream:
122
+ To use a filter, simply wrap it around atoken stream:
131
123
132
124
..code-block ::python
133
125
@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
142
134
Tree adapters
143
135
-------------
144
136
145
- Used to translate one type of tree toanother. More documentation
146
- pending, sorry.
137
+ Tree adapters can be used totranslate between tree formats.
138
+ Two adapters are provided by html5lib:
147
139
140
+ *:func: `html5lib.treeadapters.genshi.to_genshi() ` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
141
+ *:func: `html5lib.treeadapters.sax.to_sax() ` calls a SAX handler based on the tree.
148
142
149
143
Encoding discovery
150
144
------------------
@@ -156,54 +150,16 @@ the following way:
156
150
* The encoding may be explicitly specified by passing the name of the
157
151
encoding as the encoding parameter to the
158
152
:meth: `~html5lib.html5parser.HTMLParser.parse ` method on
159
- `` HTMLParser ` ` objects.
153
+ :class: ` ~html5lib.html5parser. HTMLParser ` objects.
160
154
161
155
* If no encoding is specified, the parser will attempt to detect the
162
156
encoding from a ``<meta> `` element in the first 512 bytes of the
163
157
document (this is only a partial implementation of the current HTML
164
- 5 specification).
158
+ specification).
165
159
166
- * If no encoding can be found and the chardet library is available, an
160
+ * If no encoding can be found and the:mod: ` chardet ` library is available, an
167
161
attempt will be made to sniff the encoding from the byte pattern.
168
162
169
163
* If all else fails, the default encoding will be used. This is usually
170
164
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252 >`_, which is
171
165
a common fallback used by Web browsers.
172
-
173
-
174
- Tokenizers
175
- ----------
176
-
177
- The part of the parser responsible for translating a raw input stream
178
- into meaningful tokens is the tokenizer. Currently html5lib provides
179
- two.
180
-
181
- To set up a tokenizer, simply pass it when instantiating
182
- a:class: `~html5lib.html5parser.HTMLParser `:
183
-
184
- ..code-block ::python
185
-
186
- import html5lib
187
- from html5libimport sanitizer
188
-
189
- p= html5lib.HTMLParser(tokenizer = sanitizer.HTMLSanitizer)
190
- p.parse(" <p>Surprise!<script>alert('Boo!');</script>" )
191
-
192
- HTMLTokenizer
193
- ~~~~~~~~~~~~~
194
-
195
- This is the default tokenizer, the heart of html5lib. The implementation
196
- can be found in `html5lib/tokenizer.py
197
- <https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py> `_.
198
-
199
- HTMLSanitizer
200
- ~~~~~~~~~~~~~
201
-
202
- This is a tokenizer that removes unsafe markup and CSS styles from the
203
- input. Elements that are known to be safe are passed through and the
204
- rest is converted to visible text. The default configuration of the
205
- sanitizer follows the `WHATWG Sanitization Rules
206
- <http://wiki.whatwg.org/wiki/Sanitization_rules> `_.
207
-
208
- The implementation can be found in `html5lib/sanitizer.py
209
- <https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py> `_.