Linux fork and threads

Wed Aug 9 10:32:55 EDT 2000

Seeing as I do a lot of multithreaded forking, I am interested in the
problems seen with 2.0.

Testing Python 2.0 on a Linux 2.2.14-3smp node shows the described 
behaviour for me too:
1. Most commonly Python hangs, ps -ef shows:
jon       6547  6515  0 23:25 pts/3    00:00:00 ./py2k/bin/python ./py2k/lib/pyt
jon       6548  6547  0 23:25 pts/3    00:00:00 ./py2k/bin/python ./py2k/lib/pyt
jon       6549  6548  0 23:25 pts/3    00:00:00 ./py2k/bin/python ./py2k/lib/pyt
jon       6550  6548  0 23:25 pts/3    00:00:00 ./py2k/bin/python ./py2k/lib/pyt
jon       6551  6548 99 23:25 pts/3    00:09:22 ./py2k/bin/python ./py2k/lib/pyt
jon       6552  6548 99 23:25 pts/3    00:09:20 ./py2k/bin/python ./py2k/lib/pyt

Note: the last two threads are running flat out.  Sometimes just one
of the 6 threads runs flat out.

2. Sometimes (~20%) I get a Segmentation fault, gdb shows:

GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by `./py2k/bin/python ./py2k/lib/python2.0/test/test_fork1.py'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpthread.so.0...done.
Reading symbols from /lib/libdl.so.2...done.
Reading symbols from /lib/libutil.so.1...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
#0  0x2ab91e2e in __select () from /lib/libc.so.6
(gdb) bt
#0  0x2ab91e2e in __select () from /lib/libc.so.6
#1  0x7f1ffabc in ?? ()
#2  0x80b5d83 in time_sleep (self=0x0, args=0x8240a54) at ./timemodule.c:209
#3  0x8058d1e in call_builtin (func=0x823cfb0, arg=0x8246104, kw=0x0)
    at ceval.c:2369
#4  0x8058c2b in PyEval_CallObjectWithKeywords (func=0x823cfb0, arg=0x8246104, 
    kw=0x0) at ceval.c:2337
#5  0x8057c4c in eval_code2 (co=0x8248e50, globals=0x8207ae4, locals=0x0, 
    args=0x823cf38, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    owner=0x0) at ceval.c:1675
#6  0x805906e in call_function (func=0x8243f54, arg=0x823cf2c, kw=0x0)
    at ceval.c:2491
#7  0x8058c1d in PyEval_CallObjectWithKeywords (func=0x8243f54, arg=0x823cf2c, 
    kw=0x0) at ceval.c:2335
#8  0x809d3d9 in t_bootstrap (boot_raw=0x82105c8) at ./threadmodule.c:199
#9  0x2aac7032 in pthread_start_thread (arg=0x7f1ffe60) at manager.c:213
(gdb) 

3. Very rarely, the test appears to succeed.

Removing the sleep from f() caused the segfault to occur in a different 
location:

GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by `./py2k/bin/python ./py2k/lib/python2.0/test/test_fork1.py'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpthread.so.0...done.
Reading symbols from /lib/libdl.so.2...done.
Reading symbols from /lib/libutil.so.1...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
#0  0x2ab158be in __sigsuspend (set=0x7f7ffac0)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
(gdb) bt
#0  0x2ab158be in __sigsuspend (set=0x7f7ffac0)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x2aac624c in pthread_cond_wait (cond=0x821061c, mutex=0x8210628)
    at restart.h:49
#2  0x8069eca in PyThread_acquire_lock (lock=0x8210618, waitflag=1)
    at thread_pthread.h:311
#3  0x8056061 in eval_code2 (co=0x8248d48, globals=0x8207ae4, locals=0x0, 
    args=0x8210c08, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    owner=0x0) at ceval.c:598
#4  0x805906e in call_function (func=0x8243f54, arg=0x8210bfc, kw=0x0)
    at ceval.c:2491
#5  0x8058c1d in PyEval_CallObjectWithKeywords (func=0x8243f54, arg=0x8210bfc, 
    kw=0x0) at ceval.c:2335
#6  0x809d3d9 in t_bootstrap (boot_raw=0x8248c28) at ./threadmodule.c:199
#7  0x2aac7032 in pthread_start_thread (arg=0x7f7ffe60) at manager.c:213
(gdb) 

which is very close to, but actually after, the point where Tim's theory 
predicts a problem.  Maybe it's in the implementation details of the thread
library.

Either way, Tim appears to have identified yet another lapse in the threading
code.  Has anybody thought of a good solution?

The best I've come up with so far is to have only one pthread_mutex for all
the locks (which would then consist only of a flag and a pthread_cond).
This would serialise all the lock operations, but would enable the fork
code to grab the single pthread_mutex (through a suitably generic interface) 
during the actual fork() call.

Jon.