Deadlock detection

Fri Dec 10 22:50:56 EST 2004

On Mon, 2004-12-06 at 06:21, Duncan Grisby wrote: 
> Hi,
> 
> Does anyone know of a deadlock detector for Python?  I don't think it
> would be too hard to hook into the threading module and instrument
> mutexes so they can be tested for deadlocks. I've googled around but I
> haven't found anything.

In general, accurate deadlock detection is intractable.  Like many
problems of this class you have two choices - either reduce the scope of
the problem to a non-general one or try a heuristic to guess.  

As I recall, deadlock prevention is similar to the halting problem; the
only question you can answer is "which category am I in:"

A) I know for sure there are no deadlocks
B) I don't know, maybe there are, maybe there arn't.  

In the halting problem, the answer to your question is B until you
actually halt, in which case the answer to your problem is obvious.  

Here is a quick and dirty heuristics to filter some programs as being in
bin A or bin B.

First, for bin B.  Instrument your mutex so that every time you lock, it
creates a directed edge in a global system wide graph from your current
mutex (mutex_N) to the next to most recently locked mutex you are
currently holding for the locking thread.   If your graph becomes
cyclic, you might be in bin B (well, you were never in bin A to being
with.)  Throw a "I've lost faith in my inability to deadlock" exception.

If you can *prove* that there is a strict topological order between
nested mutex invocations, then you are in bin A.  The degree of rigor in
your proof is up to you, your level of comfort and how many people die
if your code deadlocks (your personal website and medical instruments
have different standards.)

Not everybody trusts themselves.  There are a number of alternative
approaches, including having a single global critical section lock
(ancient linux smp code) or designing your mutex operation to imply the
release of all held locks.  Of course, if you must hold more than one
lock at a time, your mutex function can take a list of locks that it
will atomically apply.  The proper design of this is left as an exercise
for the reader.

Unless you thought of this from the beginning, retrofitting safe locks
into your existing large project will be expensive.  The next
possibility won't work for python, but it is useful to keep in mind.

The halting problem has a small caveat, it is applicable to "general"
Turing machines.  Place restrictions on your Turing machine that makes
it not a Turing machine and the problem goes away.  In real time systems
(oh my, cat < aborted.phd.thesis.tex > mail python-list at ... ) where you
have to compute the longest time any piece of code will take to execute
this sort of analysis is common place. Get rid of function pointers
(including weakly typed OOPLs like Python.)  You don't have to worry
about limiting loop counts like in a RTL, because we arn't interested in
timing information.  Oh, and ditch recursion.  Maybe you don't have to,
but I'm not sure.

Now walk through your code taking each possible path.  You can collapse
loops to each meaningful combination (depends on the nature of your
languages loop construct), collapse paths that don't have any mutex
operations.  You get the idea.  

Unless mutex calls are rare, or your code simple, you might spend a
while.  Largely this problem is intractable, even with simplifications,
but it is done which is why safety critical programs are (well, should
be) small and their languages not very expressive (as in finite state
machine, and not in the "but my computer is a FSM sense.") 

Adam DePrince