Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

License

NotificationsYou must be signed in to change notification settings

html5lib/html5lib-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

https://travis-ci.org/html5lib/html5lib-python.png?branch=master

html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

Usage

Simple usage follows this pattern:

importhtml5libwithopen("mydocument.html","rb")asf:document=html5lib.parse(f)

or:

importhtml5libdocument=html5lib.parse("<p>Hello World!")

By default, thedocument will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).

Two other tree types are supported:xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

importhtml5libwithopen("mydocument.html","rb")asf:lxml_etree_document=html5lib.parse(f,treebuilder="lxml")

When using withurllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

fromcontextlibimportclosingfromurllib2importurlopenimporthtml5libwithclosing(urlopen("http://example.com/"))asf:document=html5lib.parse(f,transport_encoding=f.info().getparam("charset"))

When using withurllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

fromurllib.requestimporturlopenimporthtml5libwithurlopen("http://example.com/")asf:document=html5lib.parse(f,transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

importhtml5libwithopen("mydocument.html","rb")asf:parser=html5lib.HTMLParser(strict=True)document=parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

importhtml5libparser=html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))minidom_document=parser.parse("<p>Hello World!")

More documentation is available athttps://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.3+ and PyPy. To install it,use:

$ pip install html5lib

Optional Dependencies

The following third-party libraries may be used for additionalfunctionality:

  • datrie can be used under CPython to improve parsing performance(though in almost all cases the improvement is marginal);
  • lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
  • genshi has a treewalker (but not builder); and
  • chardet can be used as a fallback when character encoding cannotbe determined.

Bugs

Please report any bugs on theissue tracker.

Tests

Unit tests require thepytest andmock libraries and can berun using thepy.test command in the root directory.

Test data are contained in a separatehtml5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

$ git submodule init$ git submodule update

If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

Questions?

There's a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.

About

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp