lxml doesn’t like control characters #96

New issue

Open

lxml doesn’t like control characters#96

Assignees

Labels

bugparser

Description

SimonSapin

opened

on Jul 22, 2013

Same issue as#33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

html5lib.parse('<p>&#1;', treebuilder='lxml')html5lib.parse('<p>\x01', treebuilder='lxml')html5lib.parse('<p>', treebuilder='lxml')html5lib.parse('<p>', treebuilder='lxml')

Traceback (most recent call last):  File "/tmp/a.py", line 4, in <module>    html5lib.parse('<p>&#1;', treebuilder='lxml')  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse    return p.parse(doc, encoding=encoding)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse    parseMeta=parseMeta, useChardet=useChardet)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse    self.mainLoop()  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop    new_token = phase.processCharacters(new_token)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters    self.tree.insertText(token["data"])  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText    parent.insertText(data)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText    builder.Element.insertText(self, data, insertBefore)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText    self._element.text += data  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

DataLossWarning: Text cannot contain U+000C

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

Metadata

Assignees

gsnedders

Labels

bugparser

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lxml doesn’t like control characters #96

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions