- Notifications
You must be signed in to change notification settings - Fork294
Description
HTML 5 Proposed Recommendation §8.2.2 The input byte stream,HTML 5.1 Draft §8.2.2 The input byte stream:
Note: Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]
Test case:
classTestInvalidSequences(unittest.TestCase):deftest_invalid_sequences(self):parser=html5lib.HTMLParser()doc=parser.parse(io.BytesIO('<!DOCTYPE html>\xA0'),encoding='ascii')self.assertTrue(parser.errors)
Expected behavior:parser.errors
is not empty
Observed behavior:parser.errors
is empty;doc
contains a tree which contains the\uFFFD
replacement character in place of the invalid byte.
Cause: InHTMLBinaryInputStream.reset
, the codec is constructed with the option'replace'
; theHTMLUnicodeInputStream
only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.