[Python-Dev] test_fork1 on SMP? (was Re: [Python Dev] test_fork1 failing --with-threads (for some people)...)

Mon, 31 Jul 2000 04:42:50 -0400

[Jeremy Hylton, with an excellent writeup of what he's found out
 about his test_fork2.py test]

It's a baffler!  AFAIK, nobody yet has thought of a way that a fork can
screw up the state of the locks in the *parent* process (it must be easy to
see how they can get screwed up in a child, because two of us already did
<wink>).

But your hangs appear to be in parents, as the gdb stacktrace shows that a
parent process is (presumably) hanging while trying to reacquire the global
lock in the expansion of the Py_END_ALLOW_THREADS macro right after the

    pid = wait(&status);

in posix_wait returns (#2 is from the macro expansion):

#0  pthread_cond_wait (cond=0x80f297c, mutex=0x80f2988) at queue.h:25
#1  0x806fee3 in PyThread_acquire_lock (lock=0x80f2978, waitflag=1)
    at ../../Python/thread_pthread.h:311
#2  0x8054ec1 in PyEval_RestoreThread (tstate=0x80fb2f0)
    at ../../Python/ceval.c:171
#3  0x80c1242 in posix_wait (self=0x0, args=0x80f1cac)
    at ../../Modules/posixmodule.c:2849

However, the stacktrace shows that the parent in question is executing
pthread_cond_wait, and this is the worst implementation of pthread_cond_wait
in existence if it's using a busy loop to wait for the condition variable to
get signaled!  If this process keeps accumulating cycles, it either *is* the
worst implementation in existence, or the process isn't actually hung and
you just happened to capture the stack at a misleading time.

The debug output may be misleading too:

> ...
> [lots of stuff omitted]

Ouch -- wish you hadn't.  Debug output is likely to come out-of-order, so
without *all* of it we can't see exactly how far out of balance the acquire
and release calls got.

> 2051 2127
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_release_lock(0x80f6498) called
> PyThread_acquire_lock(0x80f6498, 1) -> 1
> PyThread_release_lock(0x80f6498) called
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) -> 1
> PyThread_release_lock(0x80f6498) called
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) -> 1
> 1024 2130
> PyThread_release_lock(0x80f6498) called
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) -> 1
> PyThread_acquire_lock(0x80f6498, 1) called
> PyThread_acquire_lock(0x80f6498, 1) -> 1
> PyThread_release_lock(0x80f6498) called
> [blocks]

It's probably *not* blocking on the release:  In this bunch of output, there
are 7 calls to acquire, 5 calls to release, and 5 returns from acquire.
That the # of release calls == the # of acquire returns suggests this output
is "good enough", and if so that leaves two acquire calls waiting forever.

Don't know what to do now.  Obvious suggestions include putting many prints
into test_fork2 so we get a better idea of what's happening, or going back
to Trent's print-filled variant of test_fork1 (which I still think is an
easier one to study, as it forks only once instead of unboundedly often).

BTW, in *my* day, the answer to the question "what happens to a threaded
program at a fork?" was to walk away laughing -- and not sure the reality of
that has actually changed, regardless of how many more words they may have
added to the pthreads docs in the meantime <0.7 wink>.

silly-humans-ly y'rs  - tim