Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit637826f

Browse files
committed
Update and expand "moving parts" doc
1 parentc8fca0e commit637826f

File tree

1 file changed

+31
-34
lines changed

1 file changed

+31
-34
lines changed

‎doc/movingparts.rst

Lines changed: 31 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,25 @@ The moving parts
44
html5lib consists of a number of components, which are responsible for
55
handling its features.
66

7+
Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
8+
Several tree representations are supported, as are translations to other formats via *tree adapters*.
9+
The tree may be translated to a token stream with a *tree walker*, from which:class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
10+
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
711

812
Tree builders
913
-------------
1014

1115
The parser reads HTML by tokenizing the content and building a tree that
12-
the user can later access. There are three main types of trees that
13-
html5lib can build:
16+
the user can later access. html5lib can build three types of trees:
1417

15-
* ``etree`` - this is the default; builds a tree based on``xml.etree``,
18+
* ``etree`` - this is the default; builds a tree based on:mod:`xml.etree`,
1619
which can be found in the standard library. Whenever possible, the
1720
accelerated ``ElementTree`` implementation (i.e.
1821
``xml.etree.cElementTree`` on Python 2.x) is used.
1922

20-
* ``dom`` - builds a tree based on``xml.dom.minidom``.
23+
* ``dom`` - builds a tree based on:mod:`xml.dom.minidom`.
2124

22-
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
25+
* ``lxml`` - usesthe:mod:`lxml.etree` implementation of the ``ElementTree``
2326
API. The performance gains are relatively small compared to using the
2427
accelerated ``ElementTree`` module.
2528

@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134
withopen("mydocument.html","rb")as f:
3235
lxml_etree_document= html5lib.parse(f,treebuilder="lxml")
3336
34-
When instantiating a parser object, you have to pass a tree builder
35-
class in the ``tree`` keyword attribute:
37+
To get a builder class by name, use the:func:`~html5lib.treebuilders.getTreeBuilder` function.
3638

37-
..code-block::python
38-
39-
import html5lib
40-
parser= html5lib.HTMLParser(tree=SomeTreeBuilder)
41-
document= parser.parse("<p>Hello World!")
42-
43-
To get a builder class by name, use the ``getTreeBuilder`` function:
39+
When instantiating a:class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
4440

4541
..code-block::python
4642
4743
import html5lib
48-
parser= html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
44+
TreeBuilder= html5lib.getTreeBuilder("dom")
45+
parser= html5lib.HTMLParser(tree=TreeBuilder)
4946
minidom_document= parser.parse("<p>Hello World!")
5047
5148
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552
Tree walkers
5653
------------
5754

58-
Once a tree is ready, you can work on it either manually, or using
59-
a tree walker, which provides a streaming view of the tree. html5lib
60-
provides walkers for all three supported types of trees (``etree``,
61-
``dom`` and ``lxml``).
55+
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
6257

6358
The implementation of walkers can be found in `html5lib/treewalkers/
6459
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
6560

66-
Walkers make consuming HTML easier. html5lib uses them to provide you
67-
with has a couple of handy tools.
61+
html5lib provides a few tools for consuming token streams:
6862

63+
*:class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
64+
* filters, to manipulate the token stream.
6965

7066
HTMLSerializer
7167
~~~~~~~~~~~~~~
@@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes.
9086
'>'
9187
'Witam wszystkich'
9288
93-
You can customize the serializer behaviour in a variety of ways, consult
94-
the:class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
95-
documentation.
89+
You can customize the serializer behaviour in a variety of ways. Consult
90+
the:class:`~html5lib.serializer.HTMLSerializer` documentation.
9691

9792

9893
Filters
9994
~~~~~~~
10095

101-
You can alter the stream content withfilters provided by html5lib:
96+
html5lib provides severalfilters
10297

10398
*:class:`alphabeticalattributes.Filter
10499
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
110105
the document
111106

112107
*:class:`lint.Filter <html5lib.filters.lint.Filter>` raises
113-
``LintError`` exceptions on invalid tag and attribute names, invalid
108+
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
114109
PCDATA, etc.
115110

116111
*:class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
117-
removes tags from the stream which are not necessary to produce valid
112+
removes tags from thetokenstream which are not necessary to produce valid
118113
HTML
119114

120115
*:class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
@@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib:
125120

126121
*:class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
127122
collapses all whitespace characters to single spaces unless they're in
128-
``<pre/>`` or ``textarea`` tags.
123+
``<pre/>`` or ``<textarea/>`` tags.
129124

130-
To use a filter, simply wrap it around a stream:
125+
To use a filter, simply wrap it around atokenstream:
131126

132127
..code-block::python
133128
@@ -142,9 +137,11 @@ To use a filter, simply wrap it around a stream:
142137
Tree adapters
143138
-------------
144139

145-
Used to translate one type of treetoanother. More documentation
146-
pending, sorry.
140+
Tree adapters can be usedtotranslate between tree formats.
141+
Two adapters are provided by html5lib:
147142

143+
*:func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
144+
*:func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
148145

149146
Encoding discovery
150147
------------------
@@ -156,14 +153,14 @@ the following way:
156153
* The encoding may be explicitly specified by passing the name of the
157154
encoding as the encoding parameter to the
158155
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
159-
``HTMLParser`` objects.
156+
:class:`~html5lib.html5parser.HTMLParser` objects.
160157

161158
* If no encoding is specified, the parser will attempt to detect the
162159
encoding from a ``<meta>`` element in the first 512 bytes of the
163160
document (this is only a partial implementation of the current HTML
164-
5specification).
161+
specification).
165162

166-
* If no encoding can be found and the chardet library is available, an
163+
* If no encoding can be found and the:mod:`chardet` library is available, an
167164
attempt will be made to sniff the encoding from the byte pattern.
168165

169166
* If all else fails, the default encoding will be used. This is usually

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp