html5lib/html5lib-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork302
Star1.2k

Commit69606e5

authored

Merge pull request#332 from twm/update-docs

Update docs

2 parents9f9dfdb +deb98bb commit69606e5Copy full SHA for 69606e5

File tree

8 files changed

+82

-94

lines changed

8 files changed

+82

-94

lines changed

`‎AUTHORS.rst‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -45,3 +45,4 @@ Patches and suggestions`
`45`	`45`	`- Jon Dufresne`
`46`	`46`	`- Ville Skyttä`
`47`	`47`	`- Jonathan Vanasco`
	`48`	`+- Tom Most`

`‎CHANGES.rst‎`

Lines changed: 2 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -32,7 +32,7 @@ Released on July 14, 2016`
`32`	`32`
`33`	`33`	`* Cease supporting DATrie under PyPy.`
`34`	`34`
`35`		-* **Remove``PullDOM`` support, as this hasn't ever been properly
	`35`	`+* **Remove PullDOM support, as this hasn't ever been properly`
`36`	`36`	`tested, doesn't entirely work, and as far as I can tell is`
`37`	`37`	`completely unused by anyone.**`
`38`	`38`
`@@ -70,7 +70,7 @@ Released on July 14, 2016`
`70`	`70`	`to clarify their status as public.**`
`71`	`71`
`72`	`72`	`* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the`
`73`		`- sanitizer.htmlsanitizer module and move that tosaniziter. This means`
	`73`	`+ sanitizer.htmlsanitizer module and move that tosanitizer. This means`
`74`	`74`	`anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no`
`75`	`75`	`code changes.**`
`76`	`76`

`‎doc/html5lib.rst‎`

Lines changed: 4 additions & 8 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,13 +1,8 @@`
`1`	`1`	`html5lib Package`
`2`	`2`	`================`
`3`	`3`
`4`		-:mod:`html5lib` Package
`5`		`------------------------`
`6`		`-`
`7`		`-..automodule::html5lib.__init__`
`8`		`-:members:`
`9`		`-:undoc-members:`
`10`		`-:show-inheritance:`
	`4`	`+..automodule::html5lib`
	`5`	`+:members: __version__`
`11`	`6`
`12`	`7`	:mod:`constants` Module
`13`	`8`	`-----------------------`
`@@ -26,7 +21,7 @@ html5lib Package`
`26`	`21`	`:show-inheritance:`
`27`	`22`
`28`	`23`	:mod:`serializer` Module
`29`		`-----------------------`
	`24`	`+------------------------`
`30`	`25`
`31`	`26`	`..automodule::html5lib.serializer`
`32`	`27`	`:members:`
`@@ -41,4 +36,5 @@ Subpackages`
`41`	`36`	`html5lib.filters`
`42`	`37`	`html5lib.treebuilders`
`43`	`38`	`html5lib.treewalkers`
	`39`	`+html5lib.treeadapters`
`44`	`40`

`‎doc/html5lib.treeadapters.rst‎`

Lines changed: 20 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,20 @@`
	`1`	`+treebuilders Package`
	`2`	`+====================`
	`3`	`+`
	`4`	+:mod:`~html5lib.treeadapters` Package
	`5`	`+-------------------------------------`
	`6`	`+`
	`7`	`+..automodule::html5lib.treeadapters`
	`8`	`+:members:`
	`9`	`+:undoc-members:`
	`10`	`+:show-inheritance:`
	`11`	`+`
	`12`	`+..automodule::html5lib.treeadapters.genshi`
	`13`	`+:members:`
	`14`	`+:undoc-members:`
	`15`	`+:show-inheritance:`
	`16`	`+`
	`17`	`+..automodule::html5lib.treeadapters.sax`
	`18`	`+:members:`
	`19`	`+:undoc-members:`
	`20`	`+:show-inheritance:`

`‎doc/html5lib.treewalkers.rst‎`

Lines changed: 4 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,7 @@ treewalkers Package`
`10`	`10`	`:show-inheritance:`
`11`	`11`
`12`	`12`	:mod:`base` Module
`13`		`--------------------`
	`13`	`+------------------`
`14`	`14`
`15`	`15`	`..automodule::html5lib.treewalkers.base`
`16`	`16`	`:members:`
`@@ -34,7 +34,7 @@ treewalkers Package`
`34`	`34`	`:show-inheritance:`
`35`	`35`
`36`	`36`	:mod:`etree_lxml` Module
`37`		`------------------------`
	`37`	`+------------------------`
`38`	`38`
`39`	`39`	`..automodule::html5lib.treewalkers.etree_lxml`
`40`	`40`	`:members:`
`@@ -43,9 +43,9 @@ treewalkers Package`
`43`	`43`
`44`	`44`
`45`	`45`	:mod:`genshi` Module
`46`		`---------------------------`
	`46`	`+--------------------`
`47`	`47`
`48`	`48`	`..automodule::html5lib.treewalkers.genshi`
`49`	`49`	`:members:`
`50`	`50`	`:undoc-members:`
`51`		`-:show-inheritance:`
	`51`	`+:show-inheritance:`

`‎doc/movingparts.rst‎`

Lines changed: 29 additions & 73 deletions

Original file line number	Diff line number	Diff line change
`@@ -4,22 +4,25 @@ The moving parts`
`4`	`4`	`html5lib consists of a number of components, which are responsible for`
`5`	`5`	`handling its features.`
`6`	`6`
	`7`	`+Parsing uses a tree builder to generate a tree, the in-memory representation of the document.`
	`8`	`+Several tree representations are supported, as are translations to other formats via tree adapters.`
	`9`	+The tree may be translated to a token stream with a tree walker, from which:class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
	`10`	`+The token stream may also be transformed by use of filters to accomplish tasks like sanitization.`
`7`	`11`
`8`	`12`	`Tree builders`
`9`	`13`	`-------------`
`10`	`14`
`11`	`15`	`The parser reads HTML by tokenizing the content and building a tree that`
`12`		`-the user can later access. There are three main types of trees that`
`13`		`-html5lib can build:`
	`16`	`+the user can later access. html5lib can build three types of trees:`
`14`	`17`
`15`		-* ``etree`` - this is the default; builds a tree based on``xml.etree``,
	`18`	+* ``etree`` - this is the default; builds a tree based on:mod:`xml.etree`,
`16`	`19`	`which can be found in the standard library. Whenever possible, the`
`17`	`20`	accelerated ``ElementTree`` implementation (i.e.
`18`	`21`	``xml.etree.cElementTree`` on Python 2.x) is used.
`19`	`22`
`20`		-* ``dom`` - builds a tree based on``xml.dom.minidom``.
	`23`	+* ``dom`` - builds a tree based on:mod:`xml.dom.minidom`.
`21`	`24`
`22`		-* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
	`25`	+* ``lxml`` - usesthe:mod:`lxml.etree` implementation of the ``ElementTree``
`23`	`26`	`API. The performance gains are relatively small compared to using the`
`24`	`27`	accelerated ``ElementTree`` module.
`25`	`28`
`@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:`
`31`	`34`	`withopen("mydocument.html","rb")as f:`
`32`	`35`	`lxml_etree_document= html5lib.parse(f,treebuilder="lxml")`
`33`	`36`
`34`		`-When instantiating a parser object, you have to pass a tree builder`
`35`		-class in the ``tree`` keyword attribute:
	`37`	+To get a builder class by name, use the:func:`~html5lib.treebuilders.getTreeBuilder` function.
`36`	`38`
`37`		`-..code-block::python`
`38`		`-`
`39`		`-import html5lib`
`40`		`- parser= html5lib.HTMLParser(tree=SomeTreeBuilder)`
`41`		`- document= parser.parse("<p>Hello World!")`
`42`		`-`
`43`		-To get a builder class by name, use the ``getTreeBuilder`` function:
	`39`	+When instantiating a:class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
`44`	`40`
`45`	`41`	`..code-block::python`
`46`	`42`
`47`	`43`	`import html5lib`
`48`		`- parser= html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))`
	`44`	`+ TreeBuilder= html5lib.getTreeBuilder("dom")`
	`45`	`+ parser= html5lib.HTMLParser(tree=TreeBuilder)`
`49`	`46`	`minidom_document= parser.parse("<p>Hello World!")`
`50`	`47`
`51`	`48`	The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
`55`	`52`	`Tree walkers`
`56`	`53`	`------------`
`57`	`54`
`58`		`-Once a tree is ready, you can work on it either manually, or using`
`59`		`-a tree walker, which provides a streaming view of the tree. html5lib`
`60`		-provides walkers for all three supported types of trees (``etree``,
`61`		-``dom`` and ``lxml``).
	`55`	`+In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.`
	`56`	+html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
`62`	`57`
`63`	`58`	The implementation of walkers can be found in `html5lib/treewalkers/
`64`	`59`	<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
`65`	`60`
`66`		`-Walkers make consuming HTML easier. html5lib uses them to provide you`
`67`		`-with has a couple of handy tools.`
`68`		`-`
	`61`	+html5lib provides:class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
`69`	`62`
`70`	`63`	`HTMLSerializer`
`71`	`64`	`~~~~~~~~~~~~~~`
`@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.`
`90`	`83`	`'>'`
`91`	`84`	`'Witam wszystkich'`
`92`	`85`
`93`		`-You can customize the serializer behaviour in a variety of ways, consult`
`94`		-the:class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
`95`		`-documentation.`
	`86`	`+You can customize the serializer behaviour in a variety of ways. Consult`
	`87`	+the:class:`~html5lib.serializer.HTMLSerializer` documentation.
`96`	`88`
`97`	`89`
`98`	`90`	`Filters`
`99`	`91`	`~~~~~~~`
`100`	`92`
`101`		`-You can alter the stream content withfilters provided by html5lib:`
	`93`	`+html5lib provides severalfilters:`
`102`	`94`
`103`	`95`	*:class:`alphabeticalattributes.Filter
`104`	`96`	<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
`@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:`
`110`	`102`	`the document`
`111`	`103`
`112`	`104`	*:class:`lint.Filter <html5lib.filters.lint.Filter>` raises
`113`		-``LintError`` exceptions on invalid tag and attribute names, invalid
	`105`	+:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
`114`	`106`	`PCDATA, etc.`
`115`	`107`
`116`	`108`	*:class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
`117`		`- removes tags from the stream which are not necessary to produce valid`
	`109`	`+ removes tags from thetokenstream which are not necessary to produce valid`
`118`	`110`	`HTML`
`119`	`111`
`120`	`112`	*:class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
`@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:`
`125`	`117`
`126`	`118`	*:class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
`127`	`119`	`collapses all whitespace characters to single spaces unless they're in`
`128`		- ``<pre/>`` or ``textarea`` tags.
	`120`	+ ``<pre/>`` or ``<textarea/>`` tags.
`129`	`121`
`130`		`-To use a filter, simply wrap it around a stream:`
	`122`	`+To use a filter, simply wrap it around atokenstream:`
`131`	`123`
`132`	`124`	`..code-block::python`
`133`	`125`
`@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:`
`142`	`134`	`Tree adapters`
`143`	`135`	`-------------`
`144`	`136`
`145`		`-Used to translate one type of treetoanother. More documentation`
`146`		`-pending, sorry.`
	`137`	`+Tree adapters can be usedtotranslate between tree formats.`
	`138`	`+Two adapters are provided by html5lib:`
`147`	`139`
	`140`	+*:func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream<https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
	`141`	+*:func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
`148`	`142`
`149`	`143`	`Encoding discovery`
`150`	`144`	`------------------`
`@@ -156,54 +150,16 @@ the following way:`
`156`	`150`	`* The encoding may be explicitly specified by passing the name of the`
`157`	`151`	`encoding as the encoding parameter to the`
`158`	`152`	:meth:`~html5lib.html5parser.HTMLParser.parse` method on
`159`		-``HTMLParser`` objects.
	`153`	+:class:`~html5lib.html5parser.HTMLParser` objects.
`160`	`154`
`161`	`155`	`* If no encoding is specified, the parser will attempt to detect the`
`162`	`156`	encoding from a ``<meta>`` element in the first 512 bytes of the
`163`	`157`	`document (this is only a partial implementation of the current HTML`
`164`		`-5specification).`
	`158`	`+ specification).`
`165`	`159`
`166`		`-* If no encoding can be found and the chardet library is available, an`
	`160`	+* If no encoding can be found and the:mod:`chardet` library is available, an
`167`	`161`	`attempt will be made to sniff the encoding from the byte pattern.`
`168`	`162`
`169`	`163`	`* If all else fails, the default encoding will be used. This is usually`
`170`	`164`	`Windows-1252<http://en.wikipedia.org/wiki/Windows-1252>`_, which is
`171`	`165`	`a common fallback used by Web browsers.`
`172`		`-`
`173`		`-`
`174`		`-Tokenizers`
`175`		`-----------`
`176`		`-`
`177`		`-The part of the parser responsible for translating a raw input stream`
`178`		`-into meaningful tokens is the tokenizer. Currently html5lib provides`
`179`		`-two.`
`180`		`-`
`181`		`-To set up a tokenizer, simply pass it when instantiating`
`182`		-a:class:`~html5lib.html5parser.HTMLParser`:
`183`		`-`
`184`		`-..code-block::python`
`185`		`-`
`186`		`-import html5lib`
`187`		`-from html5libimport sanitizer`
`188`		`-`
`189`		`- p= html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)`
`190`		`- p.parse("<p>Surprise!<script>alert('Boo!');</script>")`
`191`		`-`
`192`		`-HTMLTokenizer`
`193`		`-~~~~~~~~~~~~~`
`194`		`-`
`195`		`-This is the default tokenizer, the heart of html5lib. The implementation`
`196`		-can be found in `html5lib/tokenizer.py
`197`		-<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
`198`		`-`
`199`		`-HTMLSanitizer`
`200`		`-~~~~~~~~~~~~~`
`201`		`-`
`202`		`-This is a tokenizer that removes unsafe markup and CSS styles from the`
`203`		`-input. Elements that are known to be safe are passed through and the`
`204`		`-rest is converted to visible text. The default configuration of the`
`205`		-sanitizer follows the `WHATWG Sanitization Rules
`206`		-<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
`207`		`-`
`208`		-The implementation can be found in `html5lib/sanitizer.py
`209`		-<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.

`‎html5lib/init.py‎`

Lines changed: 17 additions & 7 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,14 +1,23 @@`
`1`	`1`	`"""`
`2`		`-HTML parsing library based on the WHATWG"HTML5"`
`3`		`-specification. The parser is designed to be compatible with existing`
`4`		`-HTML found in the wild and implements well-defined error recovery that`
	`2`	+HTML parsing library based on the`WHATWGHTML specification
	`3`	+<https://whatwg.org/html>`_. The parser is designed to be compatible with
	`4`	`+existingHTML found in the wild and implements well-defined error recovery that`
`5`	`5`	`is largely compatible with modern desktop web browsers.`
`6`	`6`
`7`		`-Example usage:`
	`7`	`+Example usage::`
`8`	`8`
`9`		`-import html5lib`
`10`		`-f = open("my_document.html")`
`11`		`-tree = html5lib.parse(f)`
	`9`	`+ import html5lib`
	`10`	`+ with open("my_document.html", "rb") as f:`
	`11`	`+ tree = html5lib.parse(f)`
	`12`	`+`
	`13`	`+For convenience, this module re-exports the following names:`
	`14`	`+`
	`15`	+* :func:`~.html5parser.parse`
	`16`	+* :func:`~.html5parser.parseFragment`
	`17`	+* :class:`~.html5parser.HTMLParser`
	`18`	+* :func:`~.treebuilders.getTreeBuilder`
	`19`	+* :func:`~.treewalkers.getTreeWalker`
	`20`	+* :func:`~.serializer.serialize`
`12`	`21`	`"""`
`13`	`22`
`14`	`23`	`from __future__importabsolute_import,division,unicode_literals`
`@@ -22,4 +31,5 @@`
`22`	`31`	`"getTreeWalker","serialize"]`
`23`	`32`
`24`	`33`	`# this has to be at the top level, see how setup.py parses this`
	`34`	`+#: Distribution version number.`
`25`	`35`	`__version__="0.9999999999-dev"`

`‎tox.ini‎`

Lines changed: 5 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -11,7 +11,12 @@ deps =`
`11`	`11`	`base: webencodings`
`12`	`12`	`py26-base: ordereddict`
`13`	`13`	`optional: -r{toxinidir}/requirements-optional.txt`
	`14`	`+ doc: Sphinx`
`14`	`15`
`15`	`16`	`commands =`
`16`	`17`	`{envbindir}/py.test {posargs}`
`17`	`18`	`{toxinidir}/flake8-run.sh`
	`19`	`+`
	`20`	`+[testenv:doc]`
	`21`	`+changedir = doc`
	`22`	`+commands = sphinx-build -b html . _build`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit69606e5

File tree

8 files changed

8 files changed

`‎AUTHORS.rst‎`

`‎CHANGES.rst‎`

`‎doc/html5lib.rst‎`

`‎doc/html5lib.treeadapters.rst‎`

`‎doc/html5lib.treewalkers.rst‎`

`‎doc/movingparts.rst‎`

`‎html5lib/init.py‎`

`‎tox.ini‎`

0 commit comments