html5lib not thread safe. Is the Python SAX library thread-safe?

John Nagle nagle at animats.com
Mon Mar 12 12:07:56 EDT 2012


On 3/12/2012 3:05 AM, Stefan Behnel wrote:
> John Nagle, 11.03.2012 21:30:
>>     "html5lib" is apparently not thread safe.
>> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
>> Looking at the code, I've only found about three problems.
>> They're all the usual "cached in a global without locking" bug.
>> A few locks would fix that.
>>
>>     But html5lib calls the XML SAX parser. Is that thread-safe?
>> Or is there more trouble down at the bottom?
>>
>> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
>> which is thread safe, although dated.  I'm looking at converting to
>> html5lib.)
>
> You may also consider moving to lxml. BeautifulSoup supports it as a parser
> backend these days, so you wouldn't even have to rewrite your code to use
> it. And performance-wise, well ...
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Stefan

    I want to move to html5lib because it handles HTML errors as
specified by the HTML5 spec, which is what all newer browsers do.
The HTML5 spec actually specifies, in great detail, how to parse
common errors in HTML.  It's amusing seeing that formalized.
Malformed comments ( <- instead of <-- ) are now handled in
a standard way, for example.  So I'm trying to get html5parser
fixed for thread safety.

                                    John Nagle
				



More information about the Python-list mailing list