html5lib not thread safe. Is the Python SAX library thread-safe?

Cameron Simpson cs at zip.com.au
Sun Mar 11 17:45:01 EDT 2012


On 11Mar2012 13:30, John Nagle <nagle at animats.com> wrote:
|     "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
| 
|     But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
| 
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated.  I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    "Beautiful Soup 4 uses html.parser by default, but you can plug in
    lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

  @locked_property
  def foo(self):
    compute foo here ...
    return foo value

and am rolling its use out amongst my classes. Code:

  def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
    ''' A property whose access is controlled by a lock if unset.
    '''
    if prop_name is None:
      prop_name = '_' + func.func_name
    def getprop(self):
      ''' Attempt lockless fetch of property first.
          Use lock if property is unset.
      '''
      p = getattr(self, prop_name)
      if p is unset_object:
        with getattr(self, lock_name):
          p = getattr(self, prop_name)
          if p is unset_object:
            p = func(self)
            setattr(self, prop_name, p)
      return p
    return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <ed at Tekelex.Com> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
   put at least half of it back.



More information about the Python-list mailing list