DESCRIPTION
Consider the following interaction with html5lib 0.90:
    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))    >>> dom = p.parse("""<body onload="sucker()">""")     >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))    u'<body onload=sucker()>'
This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.
ANALYSIS
The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.
Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
    >>> from html5lib import tokenizer    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
But during filtering, tokens look like this:
    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]    {'namespace': u'http://www.w3.org/1999/xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.
OBSERVATION
Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?
WORKAROUND
I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:
Serialize the DOM to HTML without sanitization.
Re-parse the HTML from step 1, using the sanitizing tokenizer.

Metadata

Assignees

No one assigned

Labels

api-breakagebugsanitizer

Type

No type

Projects

No projects

Milestone

0.99999999

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sanitizing filter broken in 0.90 #72

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions