The reliability of python threads

Paul Rubin http
Thu Jan 25 16:16:59 EST 2007


"Paddy" <paddy3118 at netscape.net> writes:
> > But you're proposing cargo cult programming.
> i don't know that term.

http://en.wikipedia.org/wiki/Cargo_cult_programming

> What I'm proposing is that if, for example, a process stops running
> three times in a year at roughly three to four months intervals ,
> and it should have stayed up; then restart the server sooner, at aa
> time of your choosing,

What makes you think that restarting the server will make it less
likely to fail?  It sounds to me like there's zero evidence of that,
since you say "roughly three or four month intervals" and talk about
threading and race conditions.  If it's failing every 3 months, 15
days and 2.43 hours like clockwork, that's different, sure, restart it
every three months.  But the description I see so far sounds like a
random failure caused by some events occurring with low enough
probability that they only happen on average every few months of
operation.  That kind of thing is very common and is often best
diagnosed by instrumenting the hell out of the code.

> > There is no reason whatsoever to expect that restarting the server
> > now and then will help the problem in the slightest.
> Thats where we most likely differ.

Do you think there is a reason to expect that restarting the server
will help the problem in the slightest?  I realize you seem to expect
that, but you have not given a REASON.  That's what I mean by cargo
cult programming.

> Whilst you sit agreeing on how many fairys can dance on the end of a
> pin or not Your company could be loosing customers. You and Nick seem
> to be saying it *must* be Poisson, therefore we can't do...

I dunno about Nick, I'm saying it's best to assume that it's Poisson
and do whatever is necessary to diagnose and fix the bug, and that the
voodoo measure you're proposing is not all that likely to help and it
will take years to find out whether it helps or not (i.e. restarting
after 3 months and going another 3 months without a failure proves
nothing).

> I'm sorry, but your argument reminds me of when Western statistical
> quality control first met with the Japanese Zero defects methodologies.
> We had argued ourselves into accepting a certain amount of defective
> cars getting out to customers as the result of our theories. The
> Japanese practices emphasized *no* defects were acceptable at the
> customer, and they seemed to deliver better made cars.

I don't see your point.  You're the one who wants to keep operating
defective software instead of fixing it.

> "at random" - "every few months"
> Me thinking it happens "every few months" allows me to search for a
> fix.  If thinking it happens "at random" leads you to a brick wall,
> then switch!

But you need evidence before you can say it happens every few months.
Do you have, say, a graph of the exact dates and times of failure, the
number of requests processed so far, etc.?  If it happened at some
exact or almost exact uniform time interval or precisely once every
1.273 million requests or whatever, that tells you something.  But the
earlier description didn't sound like that.  Restarting the server is
not much better than carrying a lucky rabbit's foot.



More information about the Python-list mailing list