Nov 6, 2017 · Apr 15, 2017 · Apr 15, 2017 · Apr 15, 2017 · Apr 15, 2017 · Apr 15, 2017
diff --git a/AUTHORS.rst b/AUTHORS.rst
 - Jon Dufresne
 - Ville Skyttä
 - Jonathan Vanasco
 - Tom Most
diff --git a/CHANGES.rst b/CHANGES.rst

 * Cease supporting DATrie under PyPy.

 * **Remove``PullDOM`` support, as this hasn't ever been properly
 * **Remove PullDOM support, as this hasn't ever been properly
  tested, doesn't entirely work, and as far as I can tell is
  completely unused by anyone.**

  to clarify their status as public.**

 * **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
  sanitizer.htmlsanitizer module and move that tosaniziter. This means
  sanitizer.htmlsanitizer module and move that tosanitizer. This means
  anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
  code changes.**

diff --git a/doc/html5lib.rst b/doc/html5lib.rst
 html5lib Package
 ================

 :mod:`html5lib` Package
 -----------------------

 ..automodule::html5lib.__init__
 :members:
 :undoc-members:
 :show-inheritance:
 ..automodule::html5lib
 :members: __version__

 :mod:`constants` Module
 -----------------------
 :show-inheritance:

 :mod:`serializer` Module
 ----------------------
 ------------------------

 ..automodule::html5lib.serializer
 :members:
 html5lib.filters
 html5lib.treebuilders
 html5lib.treewalkers
 html5lib.treeadapters

diff --git a/doc/html5lib.treeadapters.rst b/doc/html5lib.treeadapters.rst
 treebuilders Package
 ====================

 :mod:`~html5lib.treeadapters` Package
 -------------------------------------

 ..automodule::html5lib.treeadapters
 :members:
 :undoc-members:
 :show-inheritance:

 ..automodule::html5lib.treeadapters.genshi
 :members:
 :undoc-members:
 :show-inheritance:

 ..automodule::html5lib.treeadapters.sax
 :members:
 :undoc-members:
 :show-inheritance:
diff --git a/doc/html5lib.treewalkers.rst b/doc/html5lib.treewalkers.rst
 :show-inheritance:

 :mod:`base` Module
 -------------------
 ------------------

 ..automodule::html5lib.treewalkers.base
 :members:
 :show-inheritance:

 :mod:`etree_lxml` Module
 -----------------------
 ------------------------

 ..automodule::html5lib.treewalkers.etree_lxml
 :members:


 :mod:`genshi` Module
 --------------------------
 --------------------

 ..automodule::html5lib.treewalkers.genshi
 :members:
 :undoc-members:
 :show-inheritance:
 :show-inheritance:
diff --git a/doc/movingparts.rst b/doc/movingparts.rst
 html5lib consists of a number of components, which are responsible for
 handling its features.

 Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
 Several tree representations are supported, as are translations to other formats via *tree adapters*.
 The tree may be translated to a token stream with a *tree walker*, from which:class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
 The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.

 Tree builders
 -------------

 The parser reads HTML by tokenizing the content and building a tree that
 the user can later access. There are three main types of trees that
 html5lib can build:
 the user can later access. html5lib can build three types of trees:

 * ``etree`` - this is the default; builds a tree based on``xml.etree``,
 * ``etree`` - this is the default; builds a tree based on:mod:`xml.etree`,
  which can be found in the standard library. Whenever possible, the
  accelerated ``ElementTree`` implementation (i.e.
  ``xml.etree.cElementTree`` on Python 2.x) is used.

 * ``dom`` - builds a tree based on``xml.dom.minidom``.
 * ``dom`` - builds a tree based on:mod:`xml.dom.minidom`.

 * ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
 * ``lxml`` - usesthe:mod:`lxml.etree` implementation of the ``ElementTree``
  API.  The performance gains are relatively small compared to using the
  accelerated ``ElementTree`` module.

 withopen("mydocument.html","rb")as f:
      lxml_etree_document= html5lib.parse(f,treebuilder="lxml")

 When instantiating a parser object, you have to pass a tree builder
 class in the ``tree`` keyword attribute:
 To get a builder class by name, use the:func:`~html5lib.treebuilders.getTreeBuilder` function.

 ..code-block::python

 import html5lib
  parser= html5lib.HTMLParser(tree=SomeTreeBuilder)
  document= parser.parse("<p>Hello World!")

 To get a builder class by name, use the ``getTreeBuilder`` function:
 When instantiating a:class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:

 ..code-block::python

 import html5lib
  parser= html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
  TreeBuilder= html5lib.getTreeBuilder("dom")
  parser= html5lib.HTMLParser(tree=TreeBuilder)
  minidom_document= parser.parse("<p>Hello World!")

 The implementation of builders can be found in `html5lib/treebuilders/
 Tree walkers
 ------------

 Once a tree is ready, you can work on it either manually, or using
 a tree walker, which provides a streaming view of the tree. html5lib
 provides walkers for all three supported types of trees (``etree``,
 ``dom`` and ``lxml``).
 In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
 html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.

 The implementation of walkers can be found in `html5lib/treewalkers/
 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.

 Walkers make consuming HTML easier. html5lib uses them to provide you
 with has a couple of handy tools.

 html5lib provides:class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.

 HTMLSerializer
 ~~~~~~~~~~~~~~
  '>'
  'Witam wszystkich'

 You can customize the serializer behaviour in a variety of ways, consult
 the:class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
 documentation.
 You can customize the serializer behaviour in a variety of ways. Consult
 the:class:`~html5lib.serializer.HTMLSerializer` documentation.


 Filters
 ~~~~~~~

 You can alter the stream content withfilters provided by html5lib:
 html5lib provides severalfilters:

 *:class:`alphabeticalattributes.Filter
  <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
  the document

 *:class:`lint.Filter <html5lib.filters.lint.Filter>` raises
 ``LintError`` exceptions on invalid tag and attribute names, invalid
 :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
  PCDATA, etc.

 *:class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
  removes tags from the stream which are not necessary to produce valid
  removes tags from thetokenstream which are not necessary to produce valid
  HTML

 *:class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes

 *:class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
  collapses all whitespace characters to single spaces unless they're in
  ``<pre/>`` or ``textarea`` tags.
  ``<pre/>`` or ``<textarea/>`` tags.

 To use a filter, simply wrap it around a stream:
 To use a filter, simply wrap it around atokenstream:

 ..code-block::python

 Tree adapters
 -------------

 Used to translate one type of treetoanother. More documentation
 pending, sorry.
 Tree adapters can be usedtotranslate between tree formats.
 Two adapters are provided by html5lib:

 *:func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
 *:func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.

 Encoding discovery
 ------------------
 * The encoding may be explicitly specified by passing the name of the
  encoding as the encoding parameter to the
 :meth:`~html5lib.html5parser.HTMLParser.parse` method on
 ``HTMLParser`` objects.
 :class:`~html5lib.html5parser.HTMLParser` objects.

 * If no encoding is specified, the parser will attempt to detect the
  encoding from a ``<meta>``  element in the first 512 bytes of the
  document (this is only a partial implementation of the current HTML
 5specification).
  specification).

 * If no encoding can be found and the chardet library is available, an
 * If no encoding can be found and the:mod:`chardet` library is available, an
  attempt will be made to sniff the encoding from the byte pattern.

 * If all else fails, the default encoding will be used. This is usually
  `Windows-1252<http://en.wikipedia.org/wiki/Windows-1252>`_, which is
  a common fallback used by Web browsers.


 Tokenizers
 ----------

 The part of the parser responsible for translating a raw input stream
 into meaningful tokens is the tokenizer. Currently html5lib provides
 two.

 To set up a tokenizer, simply pass it when instantiating
 a:class:`~html5lib.html5parser.HTMLParser`:

 ..code-block::python

 import html5lib
 from html5libimport sanitizer

  p= html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
  p.parse("<p>Surprise!<script>alert('Boo!');</script>")

 HTMLTokenizer
 ~~~~~~~~~~~~~

 This is the default tokenizer, the heart of html5lib. The implementation
 can be found in `html5lib/tokenizer.py
 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.

 HTMLSanitizer
 ~~~~~~~~~~~~~

 This is a tokenizer that removes unsafe markup and CSS styles from the
 input. Elements that are known to be safe are passed through and the
 rest is converted to visible text. The default configuration of the
 sanitizer follows the `WHATWG Sanitization Rules
 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.

 The implementation can be found in `html5lib/sanitizer.py
 <https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
diff --git a/html5lib/__init__.py b/html5lib/__init__.py
 """
 HTML parsing library based on the WHATWG"HTML5"
 specification. The parser is designed to be compatible with existing
 HTML found in the wild and implements well-defined error recovery that
 HTML parsing library based on the`WHATWGHTML specification
 <https://whatwg.org/html>`_. The parser is designed to be compatible with
 existingHTML found in the wild and implements well-defined error recovery that
 is largely compatible with modern desktop web browsers.

 Example usage:
 Example usage::

 import html5lib
 f = open("my_document.html")
 tree = html5lib.parse(f)
    import html5lib
    with open("my_document.html", "rb") as f:
        tree = html5lib.parse(f)

 For convenience, this module re-exports the following names:

 * :func:`~.html5parser.parse`
 * :func:`~.html5parser.parseFragment`
 * :class:`~.html5parser.HTMLParser`
 * :func:`~.treebuilders.getTreeBuilder`
 * :func:`~.treewalkers.getTreeWalker`
 * :func:`~.serializer.serialize`
 """

 from __future__importabsolute_import,division,unicode_literals
 "getTreeWalker","serialize"]

 # this has to be at the top level, see how setup.py parses this
 #: Distribution version number.
 __version__="0.9999999999-dev"
diff --git a/tox.ini b/tox.ini
  base: webencodings
  py26-base: ordereddict
  optional: -r{toxinidir}/requirements-optional.txt
  doc: Sphinx

 commands =
  {envbindir}/py.test {posargs}
  {toxinidir}/flake8-run.sh

 [testenv:doc]
 changedir = doc
 commands = sphinx-build -b html . _build
Original file line number	Diff line number	Diff line change
Expand Up		@@ -45,3 +45,4 @@ Patches and suggestions
		- Jon Dufresne
		- Ville Skyttä
		- Jonathan Vanasco
		- Tom Most
Original file line number	Diff line number	Diff line change
Expand Up		@@ -32,7 +32,7 @@ Released on July 14, 2016

		* Cease supporting DATrie under PyPy.

		* **Remove``PullDOM`` support, as this hasn't ever been properly
		* **Remove PullDOM support, as this hasn't ever been properly
		tested, doesn't entirely work, and as far as I can tell is
		completely unused by anyone.**

Expand DownExpand Up		@@ -70,7 +70,7 @@ Released on July 14, 2016
		to clarify their status as public.**

		* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
		sanitizer.htmlsanitizer module and move that tosaniziter. This means
		sanitizer.htmlsanitizer module and move that tosanitizer. This means
		anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
		code changes.**

Expand Down
Original file line number	Diff line number	Diff line change
		@@ -1,13 +1,8 @@
		html5lib Package
		================

		:mod:`html5lib` Package
		-----------------------
Copy link Contributor willkgOct 31, 2017 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Why take the header out here? Copy link ContributorAuthor twmNov 2, 2017• edited Loading Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Otherwise there are two headers in a row with exactly the same text.

		..automodule::html5lib.__init__
		:members:
		:undoc-members:
		:show-inheritance:
		..automodule::html5lib
		:members: __version__

		:mod:`constants` Module
		-----------------------
Expand All		@@ -26,7 +21,7 @@ html5lib Package
		:show-inheritance:

		:mod:`serializer` Module
		----------------------
		------------------------

		..automodule::html5lib.serializer
		:members:
Expand All		@@ -41,4 +36,5 @@ Subpackages
		html5lib.filters
		html5lib.treebuilders
		html5lib.treewalkers
		html5lib.treeadapters
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,20 @@
		treebuilders Package
		====================

		:mod:`~html5lib.treeadapters` Package
		-------------------------------------

		..automodule::html5lib.treeadapters
		:members:
		:undoc-members:
		:show-inheritance:

		..automodule::html5lib.treeadapters.genshi
		:members:
		:undoc-members:
		:show-inheritance:

		..automodule::html5lib.treeadapters.sax
		:members:
		:undoc-members:
		:show-inheritance:
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,7 +10,7 @@ treewalkers Package
		:show-inheritance:

		:mod:`base` Module
		-------------------
		------------------

		..automodule::html5lib.treewalkers.base
		:members:
Expand All		@@ -34,7 +34,7 @@ treewalkers Package
		:show-inheritance:

		:mod:`etree_lxml` Module
		-----------------------
		------------------------

		..automodule::html5lib.treewalkers.etree_lxml
		:members:
Expand All		@@ -43,9 +43,9 @@ treewalkers Package


		:mod:`genshi` Module
		--------------------------
		--------------------

		..automodule::html5lib.treewalkers.genshi
		:members:
		:undoc-members:
		:show-inheritance:
		:show-inheritance:
Original file line number	Diff line number	Diff line change
		@@ -1,14 +1,23 @@
		"""
		HTML parsing library based on the WHATWG"HTML5"
		specification. The parser is designed to be compatible with existing
		HTML found in the wild and implements well-defined error recovery that
		HTML parsing library based on the`WHATWGHTML specification
		<https://whatwg.org/html>`_. The parser is designed to be compatible with
		existingHTML found in the wild and implements well-defined error recovery that
		is largely compatible with modern desktop web browsers.

		Example usage:
		Example usage::

		import html5lib
		f = open("my_document.html")
		tree = html5lib.parse(f)
		import html5lib
		with open("my_document.html", "rb") as f:
		tree = html5lib.parse(f)

		For convenience, this module re-exports the following names:

		* :func:`~.html5parser.parse`
		* :func:`~.html5parser.parseFragment`
		* :class:`~.html5parser.HTMLParser`
		* :func:`~.treebuilders.getTreeBuilder`
		* :func:`~.treewalkers.getTreeWalker`
		* :func:`~.serializer.serialize`
		"""

		from __future__importabsolute_import,division,unicode_literals
Expand All		@@ -22,4 +31,5 @@
		"getTreeWalker","serialize"]

		# this has to be at the top level, see how setup.py parses this
		#: Distribution version number.
		__version__="0.9999999999-dev"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,7 +11,12 @@ deps =
		base: webencodings
		py26-base: ordereddict
		optional: -r{toxinidir}/requirements-optional.txt
		doc: Sphinx

		commands =
		{envbindir}/py.test {posargs}
		{toxinidir}/flake8-run.sh

		[testenv:doc]
		changedir = doc
		commands = sphinx-build -b html . _build