- Notifications
You must be signed in to change notification settings - Fork294
Description
In html5lib/inputstream.py,unicode_literals
is imported from__future__
. This causeshtml5lib.inputstream.BufferedStream
to misbehave, specifically the_readFromBuffer
method, which ends withreturn "".join(rv)
. Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.
An example of the problem caused:
fromurllib2importRequest,urlopenfromhtml5lib.inputstreamimportHTMLBinaryInputStreamreq=Request(url='http://example.org/')source=urlopen(req)HTMLBinaryInputStream(source)
Causing:
Traceback (most recent call last): File"<stdin>", line6, in<module> File".../html5lib/inputstream.py", line411, in__init__self.charEncoding=self.detectEncoding(parseMeta, chardet) File".../html5lib/inputstream.py", line448, indetectEncoding encoding=self.detectEncodingMeta() File".../html5lib/inputstream.py", line535, indetectEncodingMetaassertisinstance(buffer,bytes)AssertionError
(That is, whenHTMLBinaryInputStream
is used with a file-like object (such as the result ofurllib2.urlopen
), it wraps it in aBufferedStream
, which then fails (at line 535) with theassert isinstance(buffer, bytes)
.)
This can be fixed by using a byte literal in_readFromBuffer
, instead, i.e.return b"".join(rv)
. (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)