Python reliability
Steven D'Aprano
steve at REMOVEMEcyber.com.au
Sun Oct 9 22:18:42 EDT 2005
George Sakkis wrote:
> Steven D'Aprano wrote:
>
>
>>On Sun, 09 Oct 2005 23:00:04 +0300, Ville Voipio wrote:
>>
>>
>>>I would need to make some high-reliability software
>>>running on Linux in an embedded system. Performance
>>>(or lack of it) is not an issue, reliability is.
>>
>>[snip]
>>
>>
>>>The software should be running continously for
>>>practically forever (at least a year without a reboot).
>>>Is the Python interpreter (on Linux) stable and
>>>leak-free enough to achieve this?
>>
>>If performance is really not such an issue, would it really matter if you
>>periodically restarted Python? Starting Python takes a tiny amount of time:
>
>
> You must have missed or misinterpreted the "The software should be
> running continously for practically forever" part. The problem of
> restarting python is not the 200 msec lost but putting at stake
> reliability (e.g. for health monitoring devices, avionics, nuclear
> reactor controllers, etc.) and robustness (e.g. a computation that
> takes weeks of cpu time to complete is interrupted without the
> possibility to restart from the point it stopped).
Er, no, I didn't miss that at all. I did miss that it
needed continual network connections. I don't know if
there is a way around that issue, although mobile
phones move in and out of network areas, swapping
connections when and as needed.
But as for reliability, well, tell that to Buzz Aldrin
and Neil Armstrong. The Apollo 11 moon lander rebooted
multiple times on the way down to the surface. It was
designed to recover gracefully when rebooting unexpectedly:
http://www.hq.nasa.gov/office/pao/History/alsj/a11/a11.1201-pa.html
I don't have an authoritive source of how many times
the computer rebooted during the landing, but it was
measured in the dozens. Calculations were performed in
an iterative fashion, with an initial estimate that was
improved over time. If a calculation was interupted the
computer lost no more than one iteration.
I'm not saying that this strategy is practical or
useful for the original poster, but it *might* be. In a
noisy environment, it pays to design a system that can
recover transparently from a lost connection.
If your heart monitor can reboot in 200 ms, you might
miss one or two beats, but so long as you pick up the
next one, that's just noise. If your calculation takes
more than a day of CPU time to complete, you should
design it in such a way that you can save state and
pick it up again when you are ready. You never know
when the cleaner will accidently unplug the computer...
--
Steven.
More information about the Python-list
mailing list