html5lib/html5lib-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork303
Star1.2k

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

License

MIT license

1.2k stars 303 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,248 Commits
doc		doc
html5lib		html5lib
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CHANGES.rst		CHANGES.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
debug-info.py		debug-info.py
flake8-run.sh		flake8-run.sh
parse.py		parse.py
requirements-install.sh		requirements-install.sh
requirements-optional-2.6.txt		requirements-optional-2.6.txt
requirements-optional-cpython.txt		requirements-optional-cpython.txt
requirements-optional.txt		requirements-optional.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

html5lib

html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

Usage

Simple usage follows this pattern:

importhtml5libwithopen("mydocument.html","rb")asf:document=html5lib.parse(f)

or:

importhtml5libdocument=html5lib.parse("<p>Hello World!")

By default, thedocument will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).

Two other tree types are supported:xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

importhtml5libwithopen("mydocument.html","rb")asf:lxml_etree_document=html5lib.parse(f,treebuilder="lxml")

When using withurllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

fromcontextlibimportclosingfromurllib2importurlopenimporthtml5libwithclosing(urlopen("http://example.com/"))asf:document=html5lib.parse(f,encoding=f.info().getparam("charset"))

When using withurllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

fromurllib.requestimporturlopenimporthtml5libwithurlopen("http://example.com/")asf:document=html5lib.parse(f,encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

importhtml5libwithopen("mydocument.html","rb")asf:parser=html5lib.HTMLParser(strict=True)document=parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

importhtml5libparser=html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))minidom_document=parser.parse("<p>Hello World!")

More documentation is available athttp://html5lib.readthedocs.org/.

Installation

html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:

$ pip install html5lib

Optional Dependencies

The following third-party libraries may be used for additionalfunctionality:

datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
genshi has a treewalker (but not builder); and
charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python 2.
ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical order.

Bugs

Please report any bugs on theissue tracker.

Tests

Unit tests require thenose library and can be run using thenosetests command in the root directory;ordereddict isrequired under Python 2.6. All should pass.

Test data are contained in a separatehtml5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

$ git submodule init$ git submodule update

If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

Questions?

There's a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.

About

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

html5lib

Usage

Installation

Optional Dependencies

Bugs

Tests

Questions?

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors58

Languages

Movatterモバイル変換

License

html5lib/html5lib-python

Folders and files

Latest commit

History

Repository files navigation

html5lib

Usage

Installation

Optional Dependencies

Bugs

Tests

Questions?

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors58

Languages

Packages