- Notifications
You must be signed in to change notification settings - Fork294
Description
http://code.google.com/p/html5lib/issues/detail?id=211
Reported byjoseph@metaoptimize.com, Aug 24, 2012
So I know this is not well-formed HTML, but it occurred in the wild as the output from Markdown.
I have the latest pypi Python library (version = 0.95-dev).
If I try to parse the following HTML, my program goes into an infinite loop and memory usage increases without stop:
u"<p>So theres no shortage of info out there on rounded corners and I've been through much of it and I'm posting to get the communities opinons at this piont.</p>\n<p>My scenario is that we're developing a rounded corner dependant design, mainly used for interactions (<button> and <a>). We are going to use border radius for the good browsers on the block that play nice with it and then use the server to send down javscript to browsers that don't</p>\n<p>What I'm wondering is what to use to up scale the browsers that ignore border radius CSS? I need something that works on button aswell as a, div etc. I've been looking at the following and have found that some don't play nice with <button>. Also the site already uses jQuery.</p>\n<p>http://www.curvycorners.net/ - http://code.google.com/p/jquerycurvycorners/</p>\n<p>http://www.html.it/articoli/niftycube/index.html</p>\n<p>http://www.malsup.com/jquery/corner/</p>"
Aug 24, 2012 waylan
I can't comment on the infinite loop, but as the maintainer of the Markdown library, I was concerned regarding the original reporter's implication that Markdown may be producing invalid HTML. While only the output is provided, not the input, it appears to me that the invalid output is a result of invalid input. You should be wrapping those random angle-bracket tags in code tags. So "(
<button>
and<a>
)" (note the backticks surrounding each tag) would be output by Markdown as "(<button>
and<a>
)", which is valid HTML and will not result in an infinite loop in html5lib.If, in the event that the Markdown input is coming from an untrusted third party, then you absolutely should be sanitizing it before passing it on to anything else.
That said, one such way to sanitize (my recommendation) is to use the Bleach library1, which uses html5lib internally. So I guess we're back to that infinite loop.
Aug 24, 2012joseph@metaoptimize.com
The Markdown comes from the wild and is probably invalid.
My idea was to pass the HTML through tidy before running an HTML parser, thus avoiding an infinite loop. There are several tidy wrappers in Python. I used pytidylib.
I didn't play with the options to make tidy more strict, and even after tidy, html5lib still goes into an infinite loop. So my current workaround is to use tidy followed by lxml :\