Python reliability

Sun Oct 9 22:18:42 EDT 2005

George Sakkis wrote:

> Steven D'Aprano wrote:
> 
> 
>>On Sun, 09 Oct 2005 23:00:04 +0300, Ville Voipio wrote:
>>
>>
>>>I would need to make some high-reliability software
>>>running on Linux in an embedded system. Performance
>>>(or lack of it) is not an issue, reliability is.
>>
>>[snip]
>>
>>
>>>The software should be running continously for
>>>practically forever (at least a year without a reboot).
>>>Is the Python interpreter (on Linux) stable and
>>>leak-free enough to achieve this?
>>
>>If performance is really not such an issue, would it really matter if you
>>periodically restarted Python? Starting Python takes a tiny amount of time:
> 
> 
> You must have missed or misinterpreted the "The software should be
> running continously for practically forever" part. The problem of
> restarting python is not the 200 msec lost but putting at stake
> reliability (e.g. for health monitoring devices, avionics, nuclear
> reactor controllers, etc.) and robustness (e.g. a computation that
> takes weeks of cpu time to complete is interrupted without the
> possibility to restart from the point it stopped).

Er, no, I didn't miss that at all. I did miss that it 
needed continual network connections. I don't know if 
there is a way around that issue, although mobile 
phones move in and out of network areas, swapping 
connections when and as needed.

But as for reliability, well, tell that to Buzz Aldrin 
and Neil Armstrong. The Apollo 11 moon lander rebooted 
multiple times on the way down to the surface. It was 
designed to recover gracefully when rebooting unexpectedly:

http://www.hq.nasa.gov/office/pao/History/alsj/a11/a11.1201-pa.html

I don't have an authoritive source of how many times 
the computer rebooted during the landing, but it was 
measured in the dozens. Calculations were performed in 
an iterative fashion, with an initial estimate that was 
improved over time. If a calculation was interupted the 
computer lost no more than one iteration.

I'm not saying that this strategy is practical or 
useful for the original poster, but it *might* be. In a 
noisy environment, it pays to design a system that can 
recover transparently from a lost connection.

If your heart monitor can reboot in 200 ms, you might 
miss one or two beats, but so long as you pick up the 
next one, that's just noise. If your calculation takes 
more than a day of CPU time to complete, you should 
design it in such a way that you can save state and 
pick it up again when you are ready. You never know 
when the cleaner will accidently unplug the computer...

-- 
Steven.