another thread on Python threading

Sun Jun 3 17:32:54 EDT 2007

Hi,

I've recently been working on an application[1] which does quite a bit
of searching through large data structures and string matching, and I
was thinking that it would help to put some of this CPU-intensive work
in another thread, but of course this won't work because of Python's
GIL.

There's a lot of past discussion on this, and I want to bring it up
again because with the work on Python 3000, I think it is worth trying
to take a look at what can be done to address portions of the problem
through language changes.

Also, the recent hardware trend towards multicore processors is
another reason I think it is worth taking a look at the problem again.

= dynamic objects, locking and __slots__ =

I remember reading (though I can't find it now) one person's attempt
at true multithreaded programming involved adding a mutex to all
object access.  The obvious question though is - why don't other true
multithreaded languages like Java need to lock an object when making
changes?  The answer is that they don't support adding random
attributes to objects; in other words, they default to the equivalent
of __slots__.

== Why hasn't __slots__ been successful? ==

I very rarely see Python code use __slots__.  I think there are
several reasons for this.  The first is that a lot of programs don't
need to optimize on this level.  The second is that it's annoying to
use, because it means you have to type your member variables *another*
time (in addition to __init__ for example), which feels very un-
Pythonic.

== Defining object attributes ==

In my Python code, one restriction I try to follow is to set all the
attributes I use for an object in __init__.   You could do this as
class member variables, but often I want to set them in __init__
anyways from constructor arguments, so "defining" them in __init__
means I only type them once, not twice.

One random idea is to for Python 3000, make the equivalent of
__slots__ the default, *but* instead gather
the set of attributes from all member variables set in __init__.  For
example, if I write:

class Foo(object):
  def __init__(self, bar=None):
    self.__baz = 20
    if bar:
      self.__bar = bar
    else:
      self.__bar = time.time()

f = Foo()
f.otherattr = 40  # this would be an error!  Can't add random
attributes not defined in __init__

I would argue that the current Python default of supporting adding
random attributes is almost never what you really want.  If you *do*
want to set random attributes, you almost certainly want to be using a
dictionary or a subclass of one, not an object.  What's nice about the
current Python is that you don't need to redundantly type things, and
we should preserve that while still allowing more efficient
implementation strategies.

= Limited threading =

Now, I realize there are a ton of other things the GIL protects other
than object dictionaries; with true threading you would have to touch
the importer, the garbage collector, verify all the C extension
modules, etc.  Obviously non-trivial.  What if as an initial push
towards real threading, Python had support for "restricted threads".
Essentially, restricted threads would be limited to a subset of the
standard library that had been verified for thread safety, would not
be able to import new modules, etc.

Something like this:

def datasearcher(list, queue):
  for item in list:
    if item.startswith('foo'):
      queue.put(item)
  queue.done()

vals = ['foo', 'bar']
queue = queue.Queue()
threading.start_restricted_thread(datasearcher, vals, queue)
def print_item(item):
  print item
queue.set_callback(print_item)

Making up some API above I know, but the point here is "datasearcher"
could pretty easily run in a true thread and touch very little of the
interpreter; only support for atomic reference counting and a
concurrent garbage collector would be needed.

Thoughts?

[1] http://submind.verbum.org/hotwire/wiki