Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Update docs#332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
willkg merged 22 commits intohtml5lib:masterfromtwm:update-docs
Nov 6, 2017
Merged
Show file tree
Hide file tree
Changes fromall commits
Commits
Show all changes
22 commits
Select commitHold shift + click to select a range
224d9f4
Fix formatting of docstring example
twmApr 15, 2017
3fb6af3
Use with, it's idiomatic
twmApr 15, 2017
ba63e09
Fix typo in changelog
twmApr 15, 2017
6b99d52
Export and document html5lib.__version__
twmApr 15, 2017
323d736
Add a documentation env to tox.ini
twmApr 15, 2017
964d0e1
Clean up html5lib module documentation
twmApr 15, 2017
abf6224
Remove docs for HTMLTokenizer and HTMLSanitizer
twmApr 15, 2017
8554098
Fix Sphinx title underline warnings
twmApr 15, 2017
c8fca0e
Open in binary mode for Python 3
twmApr 15, 2017
637826f
Update and expand "moving parts" doc
twmApr 15, 2017
254fc90
Add treeadapters package doc
twmApr 15, 2017
deb4206
Remove duplicate header
twmApr 15, 2017
2909867
Link to the spec
twmApr 15, 2017
739dcf0
Add myself to AUTHORS
twmApr 15, 2017
cbaf304
Merge branch 'master' into update-docs
willkgOct 31, 2017
fc69044
Merge branch 'master' into update-docs
willkgOct 31, 2017
f25d7c0
Add missing colon
twmNov 2, 2017
5eb89cc
Rework token stream intro
twmNov 2, 2017
d270666
Merge remote-tracking branch 'upstream/master' into update-docs
twmNov 2, 2017
cb2702c
Remove textual backticks in changelog
twmNov 2, 2017
1084ed0
Asymptote no more
twmNov 2, 2017
deb98bb
Remove __version__ from __all__
twmNov 4, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletionsAUTHORS.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -45,3 +45,4 @@ Patches and suggestions
- Jon Dufresne
- Ville Skyttä
- Jonathan Vanasco
- Tom Most
4 changes: 2 additions & 2 deletionsCHANGES.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -32,7 +32,7 @@ Released on July 14, 2016

* Cease supporting DATrie under PyPy.

* **Remove``PullDOM`` support, as this hasn't ever been properly
* **Remove PullDOM support, as this hasn't ever been properly
tested, doesn't entirely work, and as far as I can tell is
completely unused by anyone.**

Expand DownExpand Up@@ -70,7 +70,7 @@ Released on July 14, 2016
to clarify their status as public.**

* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
sanitizer.htmlsanitizer module and move that tosaniziter. This means
sanitizer.htmlsanitizer module and move that tosanitizer. This means
anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
code changes.**

Expand Down
12 changes: 4 additions & 8 deletionsdoc/html5lib.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,8 @@
html5lib Package
================

:mod:`html5lib` Package
-----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why take the header out here?

Copy link
ContributorAuthor

@twmtwmNov 2, 2017
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Otherwise there are two headers in a row with exactly the same text.


.. automodule:: html5lib.__init__
:members:
:undoc-members:
:show-inheritance:
.. automodule:: html5lib
:members: __version__

:mod:`constants` Module
-----------------------
Expand All@@ -26,7 +21,7 @@ html5lib Package
:show-inheritance:

:mod:`serializer` Module
----------------------
------------------------

.. automodule:: html5lib.serializer
:members:
Expand All@@ -41,4 +36,5 @@ Subpackages
html5lib.filters
html5lib.treebuilders
html5lib.treewalkers
html5lib.treeadapters

20 changes: 20 additions & 0 deletionsdoc/html5lib.treeadapters.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
treebuilders Package
====================

:mod:`~html5lib.treeadapters` Package
-------------------------------------

.. automodule:: html5lib.treeadapters
:members:
:undoc-members:
:show-inheritance:

.. automodule:: html5lib.treeadapters.genshi
:members:
:undoc-members:
:show-inheritance:

.. automodule:: html5lib.treeadapters.sax
:members:
:undoc-members:
:show-inheritance:
8 changes: 4 additions & 4 deletionsdoc/html5lib.treewalkers.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -10,7 +10,7 @@ treewalkers Package
:show-inheritance:

:mod:`base` Module
-------------------
------------------

.. automodule:: html5lib.treewalkers.base
:members:
Expand All@@ -34,7 +34,7 @@ treewalkers Package
:show-inheritance:

:mod:`etree_lxml` Module
-----------------------
------------------------

.. automodule:: html5lib.treewalkers.etree_lxml
:members:
Expand All@@ -43,9 +43,9 @@ treewalkers Package


:mod:`genshi` Module
--------------------------
--------------------

.. automodule:: html5lib.treewalkers.genshi
:members:
:undoc-members:
:show-inheritance:
:show-inheritance:
102 changes: 29 additions & 73 deletionsdoc/movingparts.rst
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -4,22 +4,25 @@ The moving parts
html5lib consists of a number of components, which are responsible for
handling its features.

Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
Several tree representations are supported, as are translations to other formats via *tree adapters*.
The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.

Tree builders
-------------

The parser reads HTML by tokenizing the content and building a tree that
the user can later access. There are three main types of trees that
html5lib can build:
the user can later access. html5lib can build three types of trees:

* ``etree`` - this is the default; builds a tree based on``xml.etree``,
* ``etree`` - this is the default; builds a tree based on:mod:`xml.etree`,
which can be found in the standard library. Whenever possible, the
accelerated ``ElementTree`` implementation (i.e.
``xml.etree.cElementTree`` on Python 2.x) is used.

* ``dom`` - builds a tree based on``xml.dom.minidom``.
* ``dom`` - builds a tree based on:mod:`xml.dom.minidom`.

* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
* ``lxml`` - usesthe :mod:`lxml.etree` implementation of the ``ElementTree``
API. The performance gains are relatively small compared to using the
accelerated ``ElementTree`` module.

Expand All@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When instantiating a parser object, you have to pass a tree builder
class in the ``tree`` keyword attribute:
To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.

.. code-block:: python

import html5lib
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
document = parser.parse("<p>Hello World!")

To get a builder class by name, use the ``getTreeBuilder`` function:
When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:

.. code-block:: python

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
TreeBuilder = html5lib.getTreeBuilder("dom")
parser = html5lib.HTMLParser(tree=TreeBuilder)
minidom_document = parser.parse("<p>Hello World!")

The implementation of builders can be found in `html5lib/treebuilders/
Expand All@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
Tree walkers
------------

Once a tree is ready, you can work on it either manually, or using
a tree walker, which provides a streaming view of the tree. html5lib
provides walkers for all three supported types of trees (``etree``,
``dom`` and ``lxml``).
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.

The implementation of walkers can be found in `html5lib/treewalkers/
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.

Walkers make consuming HTML easier. html5lib uses them to provide you
with has a couple of handy tools.

html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.

HTMLSerializer
~~~~~~~~~~~~~~
Expand All@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
'>'
'Witam wszystkich'

You can customize the serializer behaviour in a variety of ways, consult
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
documentation.
You can customize the serializer behaviour in a variety of ways. Consult
the :class:`~html5lib.serializer.HTMLSerializer` documentation.


Filters
~~~~~~~

You can alter the stream content withfilters provided by html5lib:
html5lib provides severalfilters:

* :class:`alphabeticalattributes.Filter
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
Expand All@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
the document

* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
``LintError`` exceptions on invalid tag and attribute names, invalid
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is it really an AssertionError? If so, we should write up an issue to change that.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, the implementation is basically allassert statements:

assertnamespaceisNoneorisinstance(namespace,text_type)
assertnamespace!=""
assertisinstance(name,text_type)
assertname!=""
assertisinstance(token["data"],dict)
if (notnamespaceornamespace==namespaces["html"])andnameinvoidElements:
asserttype=="EmptyTag"
else:
asserttype=="StartTag"
iftype=="StartTag"andself.require_matching_tags:
open_elements.append((namespace,name))
for (namespace,name),valueintoken["data"].items():
assertnamespaceisNoneorisinstance(namespace,text_type)
assertnamespace!=""
assertisinstance(name,text_type)
assertname!=""
assertisinstance(value,text_type)
eliftype=="EndTag":
namespace=token["namespace"]
name=token["name"]
assertnamespaceisNoneorisinstance(namespace,text_type)
assertnamespace!=""
assertisinstance(name,text_type)
assertname!=""
if (notnamespaceornamespace==namespaces["html"])andnameinvoidElements:
assertFalse,"Void element reported as EndTag token: %(tag)s"% {"tag":name}
elifself.require_matching_tags:
start=open_elements.pop()
assertstart== (namespace,name)
eliftype=="Comment":
data=token["data"]
assertisinstance(data,text_type)
eliftypein ("Characters","SpaceCharacters"):
data=token["data"]
assertisinstance(data,text_type)
assertdata!=""
iftype=="SpaceCharacters":
assertdata.strip(spaceCharacters)==""
eliftype=="Doctype":
name=token["name"]
assertnameisNoneorisinstance(name,text_type)
asserttoken["publicId"]isNoneorisinstance(name,text_type)
asserttoken["systemId"]isNoneorisinstance(name,text_type)
eliftype=="Entity":
assertisinstance(token["name"],text_type)
eliftype=="SerializerError":
assertisinstance(token["data"],text_type)
else:
assertFalse,"Unknown token type: %(type)s"% {"type":type}

PCDATA, etc.

* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
removes tags from the stream which are not necessary to produce valid
removes tags from thetokenstream which are not necessary to produce valid
HTML

* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
Expand All@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:

* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
collapses all whitespace characters to single spaces unless they're in
``<pre/>`` or ``textarea`` tags.
``<pre/>`` or ``<textarea/>`` tags.

To use a filter, simply wrap it around a stream:
To use a filter, simply wrap it around atokenstream:

.. code-block:: python

Expand All@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
Tree adapters
-------------

Used to translate one type of treetoanother. More documentation
pending, sorry.
Tree adapters can be usedtotranslate between tree formats.
Two adapters are provided by html5lib:

* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.

Encoding discovery
------------------
Expand All@@ -156,54 +150,16 @@ the following way:
* The encoding may be explicitly specified by passing the name of the
encoding as the encoding parameter to the
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
``HTMLParser`` objects.
:class:`~html5lib.html5parser.HTMLParser` objects.

* If no encoding is specified, the parser will attempt to detect the
encoding from a ``<meta>`` element in the first 512 bytes of the
document (this is only a partial implementation of the current HTML
5specification).
specification).

* If no encoding can be found and the chardet library is available, an
* If no encoding can be found and the:mod:`chardet` library is available, an
attempt will be made to sniff the encoding from the byte pattern.

* If all else fails, the default encoding will be used. This is usually
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
a common fallback used by Web browsers.


Tokenizers
----------

The part of the parser responsible for translating a raw input stream
into meaningful tokens is the tokenizer. Currently html5lib provides
two.

To set up a tokenizer, simply pass it when instantiating
a :class:`~html5lib.html5parser.HTMLParser`:

.. code-block:: python

import html5lib
from html5lib import sanitizer

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
p.parse("<p>Surprise!<script>alert('Boo!');</script>")

HTMLTokenizer
~~~~~~~~~~~~~

This is the default tokenizer, the heart of html5lib. The implementation
can be found in `html5lib/tokenizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.

HTMLSanitizer
~~~~~~~~~~~~~

This is a tokenizer that removes unsafe markup and CSS styles from the
input. Elements that are known to be safe are passed through and the
rest is converted to visible text. The default configuration of the
sanitizer follows the `WHATWG Sanitization Rules
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.

The implementation can be found in `html5lib/sanitizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
24 changes: 17 additions & 7 deletionshtml5lib/__init__.py
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
"""
HTML parsing library based on the WHATWG"HTML5"
specification. The parser is designed to be compatible with existing
HTML found in the wild and implements well-defined error recovery that
HTML parsing library based on the`WHATWGHTML specification
<https://whatwg.org/html>`_. The parser is designed to be compatible with
existingHTML found in the wild and implements well-defined error recovery that
is largely compatible with modern desktop web browsers.

Example usage:
Example usage::

import html5lib
f = open("my_document.html")
tree = html5lib.parse(f)
import html5lib
with open("my_document.html", "rb") as f:
tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

* :func:`~.html5parser.parse`
* :func:`~.html5parser.parseFragment`
* :class:`~.html5parser.HTMLParser`
* :func:`~.treebuilders.getTreeBuilder`
* :func:`~.treewalkers.getTreeWalker`
* :func:`~.serializer.serialize`
"""

from __future__ import absolute_import, division, unicode_literals
Expand All@@ -22,4 +31,5 @@
"getTreeWalker", "serialize"]

# this has to be at the top level, see how setup.py parses this
#: Distribution version number.
__version__ = "0.9999999999-dev"
5 changes: 5 additions & 0 deletionstox.ini
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -11,7 +11,12 @@ deps =
base: webencodings
py26-base: ordereddict
optional: -r{toxinidir}/requirements-optional.txt
doc: Sphinx

commands =
{envbindir}/py.test {posargs}
{toxinidir}/flake8-run.sh

[testenv:doc]
changedir = doc
commands = sphinx-build -b html . _build

[8]ページ先頭

©2009-2025 Movatter.jp