Strange occasional marshal error

Wed Mar 2 10:01:11 EST 2011

Hi,

I'm using Python with ZeroMQ to distribute data around an HPC cluster.
The results have been good apart from one issue which I am completely
stuck with:

We are using marshal for serialising objects before distributing them
around the cluster, and extremely occasionally a corrupted marshal is
produced. The current workaround is to serialise everything twice and
check that the serialisations are the same. On the rare occasions that
they are not, I have dumped the files for comparison. It turns out
that there are a few positions within the serialisation where
corruption tends to occur (these positions seem to be independent of
the data of the size of the complete serialisation). These are:

4 bytes starting at 548867 (0x86003)
4 bytes starting at 4398083 (0x431c03)
4 bytes starting at 17595395 (0x10c7c03)
4 bytes starting at 19794819 (0x12e0b83)
4 bytes starting at 22269171 (0x153ccf3)
2 bytes starting at 25052819 (0x17e4693)
3 bytes starting at 28184419 (0x1ae0f63)

I note that the ratio between the later positions is almost exactly
1.125. Presumably this has something to do with memory allocation
somewhere?

Some datapoints:

- The phenomenon has been observed in a single-threaded process
without ZeroMQ
- I think the phenomenon has been observed in pickled as well as
marshalled data
- The phenomenon has been observed on different hardware

Unfortunately after quite a lot of work I still haven't managed to
reproduce this error on a single machine. Hopefully the above is
enough information for someone to speculate as to where the problem
is.

Many thanks in advance for any help.

Regards,

Graham