Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

License

NotificationsYou must be signed in to change notification settings

Python-Repository-Hub/html5lib-python

 
 

Repository files navigation

html5lib is a pure-python library for parsing HTML. It is designed toconform to the HTML specification, as is implemented by all major webbrowsers.

Requirements

Python 2.6 and above as well as Python 3.0 and above aresupported. Implementations known to work are CPython (as the referenceimplementation) and PyPy. Jython is knownnot to work due to variousbugs in its implementation of the language. Others such as IronPythonmay or may not work; if you wish to try, you are strongly encouragedto run the testsuite and report back!

The only required library dependency issix, this can be foundpackaged in PyPi.

Optionally:

  • datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
  • lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
  • genshi has a treewalker (but not builder); and
  • chardet can be used as a fallback when character encoding cannotbe determined (note currently this is only packaged on PyPi forPython 2, though several package managers include unofficial portsto Python 3).

Installation

html5lib is packaged with distutils. To install it use:

$ python setup.py install

Usage

Simple usage follows this pattern:

import html5libwith open("mydocument.html", "r") as fp:    document = html5lib.parse(f)

or:

import html5libdocument = html5lib.parse("<p>Hello World!")

More documentation is available in the docstrings.

Bugs

Please report any bugs on theissue tracker.

Tests

These are contained in the html5lib-tests repository and included as asubmodule, thus for git checkouts they must be initialized (forrelease tarballs this is unneeded):

$ git submodule init$ git submodule update

And then they can be run, withnose installed, using thenosetests command in the root directory. All should pass.

Contributing

Pull requests are more than welcome — both to the library and to thedocumentation. Some useful information:

  • We aim to follow PEP 8 in the library, but ignoring the79-character-per-line limit, instead following a soft limit of 99,but allowing lines over this where it is the readable thing to do.
  • We keep pyflakes reporting no errors or warnings at all times.
  • We keep the master branch passing all tests at all times on allsupported versions.

Travis CI is run against all pull requests and should enforce all ofthe above.

We also use an external code-review tool, which uses your GitHub loginto authenticate. You'll get emails for changes on the review.

Questions?

There's a mailing list available for support on Google Groups,html5lib-discuss,though you may have more success (and get a far quicker response)asking on IRC in #whatwg on irc.freenode.net.

About

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python68.9%
  • HTML31.1%

[8]ページ先頭

©2009-2025 Movatter.jp