Strange occasional marshal error

Graham Stratton grahamstratton at gmail.com
Wed Mar 2 10:01:11 EST 2011


Hi,

I'm using Python with ZeroMQ to distribute data around an HPC cluster.
The results have been good apart from one issue which I am completely
stuck with:

We are using marshal for serialising objects before distributing them
around the cluster, and extremely occasionally a corrupted marshal is
produced. The current workaround is to serialise everything twice and
check that the serialisations are the same. On the rare occasions that
they are not, I have dumped the files for comparison. It turns out
that there are a few positions within the serialisation where
corruption tends to occur (these positions seem to be independent of
the data of the size of the complete serialisation). These are:

4 bytes starting at 548867 (0x86003)
4 bytes starting at 4398083 (0x431c03)
4 bytes starting at 17595395 (0x10c7c03)
4 bytes starting at 19794819 (0x12e0b83)
4 bytes starting at 22269171 (0x153ccf3)
2 bytes starting at 25052819 (0x17e4693)
3 bytes starting at 28184419 (0x1ae0f63)

I note that the ratio between the later positions is almost exactly
1.125. Presumably this has something to do with memory allocation
somewhere?

Some datapoints:

- The phenomenon has been observed in a single-threaded process
without ZeroMQ
- I think the phenomenon has been observed in pickled as well as
marshalled data
- The phenomenon has been observed on different hardware

Unfortunately after quite a lot of work I still haven't managed to
reproduce this error on a single machine. Hopefully the above is
enough information for someone to speculate as to where the problem
is.

Many thanks in advance for any help.

Regards,

Graham



More information about the Python-list mailing list