Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

lxml doesn’t like control characters #96

Open
Assignees
gsnedders
Labels
@SimonSapin

Description

@SimonSapin

Same issue as#33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

html5lib.parse('<p>&#1;', treebuilder='lxml')html5lib.parse('<p>\x01', treebuilder='lxml')html5lib.parse('<p>', treebuilder='lxml')html5lib.parse('<p>', treebuilder='lxml')
Traceback (most recent call last):  File "/tmp/a.py", line 4, in <module>    html5lib.parse('<p>&#1;', treebuilder='lxml')  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse    return p.parse(doc, encoding=encoding)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse    parseMeta=parseMeta, useChardet=useChardet)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse    self.mainLoop()  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop    new_token = phase.processCharacters(new_token)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters    self.tree.insertText(token["data"])  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText    parent.insertText(data)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText    builder.Element.insertText(self, data, insertBefore)  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText    self._element.text += data  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

DataLossWarning: Text cannot contain U+000C

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp