Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Sanitizing filter broken in 0.90 #72

Closed
@gsnedders

Description

@gsnedders

http://code.google.com/p/html5lib/issues/detail?id=162

Reported bygdr@garethrees.org, Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:

    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))    >>> dom = p.parse("""<body onload="sucker()">""")     >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))    u'<body onload=sucker()>'

This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:

    >>> from html5lib import tokenizer    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}

But during filtering, tokens look like this:

    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]    {'namespace': u'http:/​/​www.w3.org/​1999/​xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}

When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

  1. Serialize the DOM to HTML without sanitization.
  2. Re-parse the HTML from step 1, using the sanitizing tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp