- Notifications
You must be signed in to change notification settings - Fork294
Description
http://code.google.com/p/html5lib/issues/detail?id=189
Reported by devin.bayer, Jul 25, 2011
Hi. I realize this is by design, but it's not intuitive, since similar standard classes like YamlDecoder and JSONDecoder are.
It would be more clear if the input stream was supplied to the constructor, like with ElementTree.
But at least, please document this in the class.
Sep 16, 2011 geoffers
Is there any reason to document it? This is the case with all Python code in CPython (other implementations may differ), so the cases where things are threadsafe are the notable exceptions.
Sep 16, 2011 devin.bayer
(Most?) Everything in the python standard library is threadsafe and most extensions are. I think you are referring to the GIL, which is different. That prevents parallel execution, but if one thread is blocking, the others can run safely.
The problem with the design of HTMLParser is that two threads can interfere with each other, even if they are not running at the same time.
Mar 11, 2012nagle@animats.com
This is clearly a defect. This is an object-oriented library in an object oriented language. Two parsers should be completely independent of each other, with no shared global variables, and thus thread-safe. If that's not the case, this is a defect.
Do I have to scrap my plans to convert a parallel web crawler from BeautifulSoup to html5lib?
This looks fixable. The trouble spots include at least these global variables:
dom.py: moduleCache
That could be easily fixed with a lock in getDomModule. That's a once per parse event, so there's no performance issue. All that's needs is
import threading...Lok = threading.Lock()with Lok() : ... critical section...
etree.py: moduleCache
Same issue.
etree.lxml: fullTree
This seems to be set only once, at load time. Is it changed elsewhere?
what have I missed? Some lower level library? Is Python's SAX parser unsafe?
This can and should be fixed.