The reliability of python threads

Thu Jan 25 04:26:40 EST 2007

In article <1169675599.502726.5070 at a34g2000cwb.googlegroups.com>,
"Paddy" <paddy3118 at netscape.net> writes:
|> 
|> Three to four months before `strange errors`? I'd spend some time
|> correlating logs; not just for your program, but for everything running
|> on the server. Then I'd expect to cut my losses and arrange to safely
|> re-start the program every TWO months.
|> (I'd arrange the re-start after collecting logs but before their
|> analysis. Life is too short).

Forget it.  That strategy is fine in general, but is a waste of time
where threading issues are involved (or signal handling, or some types
of communication problem, for that matter).  There are three unrelated
killer facts that interact:

    Such failures are usually probabilistic ("Poisson process"), and
so have no "history".

    The expected number is usually proportional to the square of the
activity, sometimes a higher power.

    Virtually nothing involved does any routine logging, or even has
options to log relevant events.

The first means that the strategy of restarting doesn't help.  All
three mean that current logs are almost never any use.

Regards,
Nick Maclaren.