Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

HTMLParser is not threadsafe #8

Open
Labels
Milestone
@gsnedders

Description

@gsnedders

http://code.google.com/p/html5lib/issues/detail?id=189

Reported by devin.bayer, Jul 25, 2011

Hi. I realize this is by design, but it's not intuitive, since similar standard classes like YamlDecoder and JSONDecoder are.

It would be more clear if the input stream was supplied to the constructor, like with ElementTree.

But at least, please document this in the class.

Sep 16, 2011 geoffers

Is there any reason to document it? This is the case with all Python code in CPython (other implementations may differ), so the cases where things are threadsafe are the notable exceptions.

Sep 16, 2011 devin.bayer

(Most?) Everything in the python standard library is threadsafe and most extensions are. I think you are referring to the GIL, which is different. That prevents parallel execution, but if one thread is blocking, the others can run safely.

The problem with the design of HTMLParser is that two threads can interfere with each other, even if they are not running at the same time.

Mar 11, 2012nagle@animats.com

This is clearly a defect. This is an object-oriented library in an object oriented language. Two parsers should be completely independent of each other, with no shared global variables, and thus thread-safe. If that's not the case, this is a defect.

Do I have to scrap my plans to convert a parallel web crawler from BeautifulSoup to html5lib?

This looks fixable. The trouble spots include at least these global variables:

dom.py: moduleCache

That could be easily fixed with a lock in getDomModule. That's a once per parse event, so there's no performance issue. All that's needs is

import threading...Lok = threading.Lock()with Lok() :  ... critical section...

etree.py: moduleCache

Same issue.

etree.lxml: fullTree

This seems to be set only once, at load time. Is it changed elsewhere?

what have I missed? Some lower level library? Is Python's SAX parser unsafe?

This can and should be fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp