- Notifications
You must be signed in to change notification settings - Fork0
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
License
awesome-python/html5lib-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
html5lib is a pure-python library for parsing HTML. It is designed toconform to the HTML specification, as is implemented by all major webbrowsers.
Python 2.6 and above as well as Python 3.0 and above aresupported. Implementations known to work are CPython (as the referenceimplementation) and PyPy. Jython is knownnot to work due to variousbugs in its implementation of the language. Others such as IronPythonmay or may not work; if you wish to try, you are strongly encouragedto run the testsuite and report back!
The only required library dependency issix
, this can be foundpackaged in PyPi.
Optionally:
datrie
can be used to improve parsing performance (though inalmost all cases the improvement is marginal);lxml
is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);genshi
has a treewalker (but not builder); andchardet
can be used as a fallback when character encoding cannotbe determined (note currently this is only packaged on PyPi forPython 2, though several package managers include unofficial portsto Python 3).
html5lib is packaged with distutils. To install it use:
$ python setup.py install
Simple usage follows this pattern:
import html5libwith open("mydocument.html", "r") as fp: document = html5lib.parse(f)
or:
import html5libdocument = html5lib.parse("<p>Hello World!")
More documentation is available in the docstrings.
Please report any bugs on theissue tracker.
These are contained in the html5lib-tests repository and included as asubmodule, thus for git checkouts they must be initialized (forrelease tarballs this is unneeded):
$ git submodule init$ git submodule update
And then they can be run, withnose
installed, using thenosetests
command in the root directory. All should pass.
Pull requests are more than welcome — both to the library and to thedocumentation. Some useful information:
- We aim to follow PEP 8 in the library, but ignoring the79-character-per-line limit, instead following a soft limit of 99,but allowing lines over this where it is the readable thing to do.
- We keep pyflakes reporting no errors or warnings at all times.
- We keep the master branch passing all tests at all times on allsupported versions.
Travis CI is run against all pull requests and should enforce all ofthe above.
We also use an external code-review tool, which uses your GitHub loginto authenticate. You'll get emails for changes on the review.
There's a mailing list available for support on Google Groups,html5lib-discuss,though you may have more success (and get a far quicker response)asking on IRC in #whatwg on irc.freenode.net.
About
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- Python99.9%
- Other0.1%