- Notifications
You must be signed in to change notification settings - Fork294
Description
This is just a FYI, I have been working on a modified version of html5lib that achieves the following goals:
- Preserves attribute order
- Optionally includes line and column number information when parsing
- Handles XML namespaces correctly, so that if you happen to parse an XHTML document with html5lib you dont lose all the namespace information
- Create a new treebuilder for lxml
- Various performance improvements
Using my new lxml treebuilder parsing performance with line numbers and attribute order preservation is the same as for vanilla html5lib with its builtin treebuilder. The speed improvements come mainly from the new lxml builder and an optimized inputstream class for in memory streams.
I make no claims as to the relevance of my work for html5lib. I am just sharing it with you as a way of giving back. You are welcome to use the patches or not. Feel free to ask if you need any clarification.
The code is inhttps://github.com/kovidgoyal/calibre/tree/master/src/html5lib (these are the changes to html5lib itself)
and the lxml builder is in:
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/oeb/polish/parsing.py