[Python-Dev] test_fork1 on SMP? (was Re: [Python Dev] test_fork1 failing --with-threads (for some people)...)

Thu, 27 Jul 2000 20:32:00 -0400

[Jeremy Hylton]
> ...
> For both bugs, though, a mutex and a condition variable are being use:

Oh ya -- now that you mention it, I wrote that code <wink> -- but more than
7 years ago!  How could a failure have gone undetected for so long?

> The interpreter lock is being acquired and released in both cases.
>
> My current theory is that Python isn't dealing with the interpreter
> lock correctly across a fork.  If some thread other than the one
> calling fork holds the interpreter lock mutex,

Let's flesh out the most likely bad case:

    the main thread gets into posix_fork
    one of the spawned threads (say, thread 1) tries to acquire the
        global lock
    thread 1 gets into PyThread_acquire_lock
    thread 1 grabs the pthread mutex guarding "the global lock"
    the main thread executes fork() while thread 1 holds the mutex
    in the original process, everything's still cool:  thread 1 still
       exists there, and it releases the mutex it acquired (after seeing
       that the "is it locked?" flag is set), yadda yadda yadda.
    but in the forked process, things are not cool:  the (cloned) mutex
       guarding the global lock is still held

What happens next in the child process is interesting <wink>:  there is only
one thread in the child process, and it's still in posix_fork.  There it
sets the main_thread and main_pid globals, and returns to the interpreter
loop.  That the forked pthread_mutex is still locked is irrelevant at this
point:  the child process won't care about that until enough bytecodes pass
that its sole thread offers to yield.  It doesn't bash into the
already-locked cloned pthread mutex until it executes PyThread_release_lock
as part of offering to yield.  Then the child hangs.  Don't know about this
specific implementation, but phtread mutex acquires were usually implemented
as busy-loops in my day (which is one reason Python locks were *not* modeled
directly as pthread mutexes).

So, in this scenario, the child hangs in a busy loop after an accidental
amount of time passes after the fork.

Matches your symptoms?  It doesn't match Trent's segfault, but one nightmare
at a time ...