Various html5lib improvements #119

New issue

Open

Various html5lib improvements#119

Labels

enhancementparser

Description

kovidgoyal

opened

on Nov 2, 2013

This is just a FYI, I have been working on a modified version of html5lib that achieves the following goals:

Preserves attribute order
Optionally includes line and column number information when parsing
Handles XML namespaces correctly, so that if you happen to parse an XHTML document with html5lib you dont lose all the namespace information
Create a new treebuilder for lxml
Various performance improvements

Using my new lxml treebuilder parsing performance with line numbers and attribute order preservation is the same as for vanilla html5lib with its builtin treebuilder. The speed improvements come mainly from the new lxml builder and an optimized inputstream class for in memory streams.

I make no claims as to the relevance of my work for html5lib. I am just sharing it with you as a way of giving back. You are welcome to use the patches or not. Feel free to ask if you need any clarification.

The code is inhttps://github.com/kovidgoyal/calibre/tree/master/src/html5lib (these are the changes to html5lib itself)

and the lxml builder is in:
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/oeb/polish/parsing.py

Metadata

Assignees

No one assigned

Labels

enhancementparser

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various html5lib improvements #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions