Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit69606e5

Browse files
authored
Merge pull requesthtml5lib#332 from twm/update-docs
Update docs
2 parents9f9dfdb +deb98bb commit69606e5

8 files changed

+82
-94
lines changed

‎AUTHORS.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,4 @@ Patches and suggestions
4545
- Jon Dufresne
4646
- Ville Skyttä
4747
- Jonathan Vanasco
48+
- Tom Most

‎CHANGES.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Released on July 14, 2016
3232

3333
* Cease supporting DATrie under PyPy.
3434

35-
* **Remove``PullDOM`` support, as this hasn't ever been properly
35+
* **Remove PullDOM support, as this hasn't ever been properly
3636
tested, doesn't entirely work, and as far as I can tell is
3737
completely unused by anyone.**
3838

@@ -70,7 +70,7 @@ Released on July 14, 2016
7070
to clarify their status as public.**
7171

7272
* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
73-
sanitizer.htmlsanitizer module and move that tosaniziter. This means
73+
sanitizer.htmlsanitizer module and move that tosanitizer. This means
7474
anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
7575
code changes.**
7676

‎doc/html5lib.rst

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,8 @@
11
html5lib Package
22
================
33

4-
:mod:`html5lib` Package
5-
-----------------------
6-
7-
..automodule::html5lib.__init__
8-
:members:
9-
:undoc-members:
10-
:show-inheritance:
4+
..automodule::html5lib
5+
:members: __version__
116

127
:mod:`constants` Module
138
-----------------------
@@ -26,7 +21,7 @@ html5lib Package
2621
:show-inheritance:
2722

2823
:mod:`serializer` Module
29-
----------------------
24+
------------------------
3025

3126
..automodule::html5lib.serializer
3227
:members:
@@ -41,4 +36,5 @@ Subpackages
4136
html5lib.filters
4237
html5lib.treebuilders
4338
html5lib.treewalkers
39+
html5lib.treeadapters
4440

‎doc/html5lib.treeadapters.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
treebuilders Package
2+
====================
3+
4+
:mod:`~html5lib.treeadapters` Package
5+
-------------------------------------
6+
7+
..automodule::html5lib.treeadapters
8+
:members:
9+
:undoc-members:
10+
:show-inheritance:
11+
12+
..automodule::html5lib.treeadapters.genshi
13+
:members:
14+
:undoc-members:
15+
:show-inheritance:
16+
17+
..automodule::html5lib.treeadapters.sax
18+
:members:
19+
:undoc-members:
20+
:show-inheritance:

‎doc/html5lib.treewalkers.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ treewalkers Package
1010
:show-inheritance:
1111

1212
:mod:`base` Module
13-
-------------------
13+
------------------
1414

1515
..automodule::html5lib.treewalkers.base
1616
:members:
@@ -34,7 +34,7 @@ treewalkers Package
3434
:show-inheritance:
3535

3636
:mod:`etree_lxml` Module
37-
-----------------------
37+
------------------------
3838

3939
..automodule::html5lib.treewalkers.etree_lxml
4040
:members:
@@ -43,9 +43,9 @@ treewalkers Package
4343

4444

4545
:mod:`genshi` Module
46-
--------------------------
46+
--------------------
4747

4848
..automodule::html5lib.treewalkers.genshi
4949
:members:
5050
:undoc-members:
51-
:show-inheritance:
51+
:show-inheritance:

‎doc/movingparts.rst

Lines changed: 29 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,25 @@ The moving parts
44
html5lib consists of a number of components, which are responsible for
55
handling its features.
66

7+
Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
8+
Several tree representations are supported, as are translations to other formats via *tree adapters*.
9+
The tree may be translated to a token stream with a *tree walker*, from which:class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
10+
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
711

812
Tree builders
913
-------------
1014

1115
The parser reads HTML by tokenizing the content and building a tree that
12-
the user can later access. There are three main types of trees that
13-
html5lib can build:
16+
the user can later access. html5lib can build three types of trees:
1417

15-
* ``etree`` - this is the default; builds a tree based on``xml.etree``,
18+
* ``etree`` - this is the default; builds a tree based on:mod:`xml.etree`,
1619
which can be found in the standard library. Whenever possible, the
1720
accelerated ``ElementTree`` implementation (i.e.
1821
``xml.etree.cElementTree`` on Python 2.x) is used.
1922

20-
* ``dom`` - builds a tree based on``xml.dom.minidom``.
23+
* ``dom`` - builds a tree based on:mod:`xml.dom.minidom`.
2124

22-
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
25+
* ``lxml`` - usesthe:mod:`lxml.etree` implementation of the ``ElementTree``
2326
API. The performance gains are relatively small compared to using the
2427
accelerated ``ElementTree`` module.
2528

@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134
withopen("mydocument.html","rb")as f:
3235
lxml_etree_document= html5lib.parse(f,treebuilder="lxml")
3336
34-
When instantiating a parser object, you have to pass a tree builder
35-
class in the ``tree`` keyword attribute:
37+
To get a builder class by name, use the:func:`~html5lib.treebuilders.getTreeBuilder` function.
3638

37-
..code-block::python
38-
39-
import html5lib
40-
parser= html5lib.HTMLParser(tree=SomeTreeBuilder)
41-
document= parser.parse("<p>Hello World!")
42-
43-
To get a builder class by name, use the ``getTreeBuilder`` function:
39+
When instantiating a:class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
4440

4541
..code-block::python
4642
4743
import html5lib
48-
parser= html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
44+
TreeBuilder= html5lib.getTreeBuilder("dom")
45+
parser= html5lib.HTMLParser(tree=TreeBuilder)
4946
minidom_document= parser.parse("<p>Hello World!")
5047
5148
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552
Tree walkers
5653
------------
5754

58-
Once a tree is ready, you can work on it either manually, or using
59-
a tree walker, which provides a streaming view of the tree. html5lib
60-
provides walkers for all three supported types of trees (``etree``,
61-
``dom`` and ``lxml``).
55+
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
6257

6358
The implementation of walkers can be found in `html5lib/treewalkers/
6459
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
6560

66-
Walkers make consuming HTML easier. html5lib uses them to provide you
67-
with has a couple of handy tools.
68-
61+
html5lib provides:class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
6962

7063
HTMLSerializer
7164
~~~~~~~~~~~~~~
@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
9083
'>'
9184
'Witam wszystkich'
9285
93-
You can customize the serializer behaviour in a variety of ways, consult
94-
the:class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
95-
documentation.
86+
You can customize the serializer behaviour in a variety of ways. Consult
87+
the:class:`~html5lib.serializer.HTMLSerializer` documentation.
9688

9789

9890
Filters
9991
~~~~~~~
10092

101-
You can alter the stream content withfilters provided by html5lib:
93+
html5lib provides severalfilters:
10294

10395
*:class:`alphabeticalattributes.Filter
10496
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
110102
the document
111103

112104
*:class:`lint.Filter <html5lib.filters.lint.Filter>` raises
113-
``LintError`` exceptions on invalid tag and attribute names, invalid
105+
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
114106
PCDATA, etc.
115107

116108
*:class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
117-
removes tags from the stream which are not necessary to produce valid
109+
removes tags from thetokenstream which are not necessary to produce valid
118110
HTML
119111

120112
*:class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:
125117

126118
*:class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
127119
collapses all whitespace characters to single spaces unless they're in
128-
``<pre/>`` or ``textarea`` tags.
120+
``<pre/>`` or ``<textarea/>`` tags.
129121

130-
To use a filter, simply wrap it around a stream:
122+
To use a filter, simply wrap it around atokenstream:
131123

132124
..code-block::python
133125
@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
142134
Tree adapters
143135
-------------
144136

145-
Used to translate one type of treetoanother. More documentation
146-
pending, sorry.
137+
Tree adapters can be usedtotranslate between tree formats.
138+
Two adapters are provided by html5lib:
147139

140+
*:func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
141+
*:func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
148142

149143
Encoding discovery
150144
------------------
@@ -156,54 +150,16 @@ the following way:
156150
* The encoding may be explicitly specified by passing the name of the
157151
encoding as the encoding parameter to the
158152
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
159-
``HTMLParser`` objects.
153+
:class:`~html5lib.html5parser.HTMLParser` objects.
160154

161155
* If no encoding is specified, the parser will attempt to detect the
162156
encoding from a ``<meta>`` element in the first 512 bytes of the
163157
document (this is only a partial implementation of the current HTML
164-
5specification).
158+
specification).
165159

166-
* If no encoding can be found and the chardet library is available, an
160+
* If no encoding can be found and the:mod:`chardet` library is available, an
167161
attempt will be made to sniff the encoding from the byte pattern.
168162

169163
* If all else fails, the default encoding will be used. This is usually
170164
`Windows-1252<http://en.wikipedia.org/wiki/Windows-1252>`_, which is
171165
a common fallback used by Web browsers.
172-
173-
174-
Tokenizers
175-
----------
176-
177-
The part of the parser responsible for translating a raw input stream
178-
into meaningful tokens is the tokenizer. Currently html5lib provides
179-
two.
180-
181-
To set up a tokenizer, simply pass it when instantiating
182-
a:class:`~html5lib.html5parser.HTMLParser`:
183-
184-
..code-block::python
185-
186-
import html5lib
187-
from html5libimport sanitizer
188-
189-
p= html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
190-
p.parse("<p>Surprise!<script>alert('Boo!');</script>")
191-
192-
HTMLTokenizer
193-
~~~~~~~~~~~~~
194-
195-
This is the default tokenizer, the heart of html5lib. The implementation
196-
can be found in `html5lib/tokenizer.py
197-
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
198-
199-
HTMLSanitizer
200-
~~~~~~~~~~~~~
201-
202-
This is a tokenizer that removes unsafe markup and CSS styles from the
203-
input. Elements that are known to be safe are passed through and the
204-
rest is converted to visible text. The default configuration of the
205-
sanitizer follows the `WHATWG Sanitization Rules
206-
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
207-
208-
The implementation can be found in `html5lib/sanitizer.py
209-
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.

‎html5lib/__init__.py

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
11
"""
2-
HTML parsing library based on the WHATWG"HTML5"
3-
specification. The parser is designed to be compatible with existing
4-
HTML found in the wild and implements well-defined error recovery that
2+
HTML parsing library based on the`WHATWGHTML specification
3+
<https://whatwg.org/html>`_. The parser is designed to be compatible with
4+
existingHTML found in the wild and implements well-defined error recovery that
55
is largely compatible with modern desktop web browsers.
66
7-
Example usage:
7+
Example usage::
88
9-
import html5lib
10-
f = open("my_document.html")
11-
tree = html5lib.parse(f)
9+
import html5lib
10+
with open("my_document.html", "rb") as f:
11+
tree = html5lib.parse(f)
12+
13+
For convenience, this module re-exports the following names:
14+
15+
* :func:`~.html5parser.parse`
16+
* :func:`~.html5parser.parseFragment`
17+
* :class:`~.html5parser.HTMLParser`
18+
* :func:`~.treebuilders.getTreeBuilder`
19+
* :func:`~.treewalkers.getTreeWalker`
20+
* :func:`~.serializer.serialize`
1221
"""
1322

1423
from __future__importabsolute_import,division,unicode_literals
@@ -22,4 +31,5 @@
2231
"getTreeWalker","serialize"]
2332

2433
# this has to be at the top level, see how setup.py parses this
34+
#: Distribution version number.
2535
__version__="0.9999999999-dev"

‎tox.ini

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,12 @@ deps =
1111
base: webencodings
1212
py26-base: ordereddict
1313
optional: -r{toxinidir}/requirements-optional.txt
14+
doc: Sphinx
1415

1516
commands =
1617
{envbindir}/py.test {posargs}
1718
{toxinidir}/flake8-run.sh
19+
20+
[testenv:doc]
21+
changedir = doc
22+
commands = sphinx-build -b html . _build

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp