[Python-ideas] Threading hooks and disable gc per thread

Christian Heimes lists at cheimes.de
Thu May 12 01:58:05 CEST 2011


Hello,

today I've spent several hours debugging a segfault in JCC [1]. JCC is a
framework to wrap Java code for Python. It's most prominently used in
PyLucene [2]. You can read more about my debugging in [3]

With JCC every Python thread must be registered at the JVM through JCC.
An unattached thread, that accesses a wrapped Java object, leads to
errors and may even cause a segfault. Accessing also includes garbage
collection. A code line like

   a = {}

or
   "a b c".split()

can segfault since the allocation of a dict or a bound method runs
through _PyObject_GC_New(), which may trigger a cyclic garbage
collection run. If the current thread isn't attached to the JVM but
triggers a gc.collect() with some Java objects in a cycle, the
interpreter crashes. It's quite complicated and hard to "fix" third
party tools to attach all threads created in the third party library.

The issue could be solved with a simple on_thread_start hook in the
threading module. However there is more to it. In order to free memory
threads must also be detached from the JVM, when a thread has ended. A
second on_thread_stop hook isn't enough since the bound methods may also
lead to a gc.collect() run after the thread is detached.

I propose three changes to Python in order to fix the issue:

on thread start hook
--------------------

Similar to the atexit module, third party modules can register a
callable with *args and **kwargs. The functions are called inside the
newly created thread just before the target is called. The best place
for the hook list is threading.Thread._bootstrap_inner() right before
the try: self.run() except: block. Exceptions are ignored during the
call but reported to the user at the end (same as atexit's
atexit_callfunc())


on thread end hook
------------------

Same as on thread start hook but the callables are called inside the
dying thread after self.run().


gc.disable_thread(), gc.enable_thread(), gc.isenabled_thread()
--------------------------------------------------------------

Right now almost any code can trigger a gc.collect() run
non-deterministicly. Some application like JCC want to control if
gc.collect() is wanted on a thread level. This could be solved with a
new flat in PyThreadState. PyThreadState->gc_enabled is enabled by
default. When the flag is false, _PyObject_GC_Malloc() doesn't start a
gc.collect() run for that thread. The collection is delayed until
another thread or the main thread triggers it.

The three functions should also have a C equivalent so C code can
prevent gc in a thread.


Thoughs?

Christian

[1] http://lucene.apache.org/pylucene/jcc/index.html
[2] http://lucene.apache.org/pylucene/
[3]
http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201105.mbox/browser



More information about the Python-ideas mailing list