- Notifications
You must be signed in to change notification settings - Fork294
Description
I'm the lead developer of Beautiful Soup, which has html5lib as an optional dependency. Over the past couple of years I've gotten a number of notifications from Google's oss-fuzz project about unhandled exceptions that actually turned out to be problems in html5lib. There wasn't much I could do with these errors, but now that it looks like html5lib maintenance is picking up, I can pass them on to you. (Sorry. 😿)
I've incorporated the fuzz reports intothe Beautiful Soup test suite, andthe test cases themselves are here, but here's a general picture of what problems I see. In each case, I believe just parsing the bad markup is enough to trigger the error.
clusterfuzz-testcase-minimized-bs4_fuzzer-4999465949331456
Markup:b')<a><math><TR><a><mI><a><p><a>'
Error:
self = <html>, node = <p>, refNode = None def insertBefore(self, node, refNode):> index = self.element.index(refNode.element)E AttributeError: 'NoneType' object has no attribute 'element'
clusterfuzz-testcase-minimized-bs4_fuzzer-5843991618256896
Markup:b'-<math><sElect><mi><sElect><sElect>'
Error:
def resetInsertionMode(self): ... # Check for conditions that should only happen in the innerHTML # case if nodeName in ("select", "colgroup", "head", "html"):> assert self.innerHTMLE AssertionError
clusterfuzz-testcase-minimized-bs4_fuzzer-6241471367348224
Markup:b'ñ<table><svg><html>'
Error:
self = <html5lib.html5parser.getPhases.<locals>.InTablePhase object at 0x7f8f405ad440> def processEOF(self): if self.tree.openElements[-1].name != "html": self.parser.parseError("eof-in-table") else:> assert self.parser.innerHTMLE AssertionError
clusterfuzz-testcase-minimized-bs4_fuzzer-6600557255327744
Markup:b'\t<TABLE><<!>;<!><<!>.<lec><th>i><a><mat\x00\x01<mi\x00a><math>><th><mI>chardeta\xff\xff\xff\xff<><th><mI><||||||||A<select><>qu?\xbemath><th><mie>qu'
Error:
self = <html5lib.html5parser.getPhases.<locals>.InTableBodyPhase object at 0x7f8f4184ce00> def clearStackToTableBodyContext(self): while self.tree.openElements[-1].name not in ("tbody", "tfoot", "thead", "html"): # self.parser.parseError("unexpected-implied-end-tag-in-table", # {"name": self.tree.openElements[-1].name}) self.tree.openElements.pop() if self.tree.openElements[-1].name == "html":> assert self.parser.innerHTMLE AssertionError
Also reported to me recently was the issue that was reported to you as issue#557.