[Numpy-discussion] py2/py3 pickling

Mon Aug 24 14:30:14 EDT 2015

On Aug 24, 2015 9:29 AM, "Pauli Virtanen" <pav at iki.fi> wrote:
>
> 24.08.2015, 01:02, Chris Laumann kirjoitti:
> [clip]
> > Is there documentation about the limits and workarounds for py2/py3
> > pickle/np.save/load compatibility? I haven't found anything except
> > developer bug tracking discussions (eg. #4879 in github numpy).
>
> Not sure if it's written down somewhere but:
>
> - You should consider pickles not portable between Py2/3.
>
> - Setting encoding='bytes' or encoding='latin1' should produce correct
> results for numerical data. However, neither is "safe" because the
> option also affects other data than numpy arrays that you may have
> possibly saved.

For those wondering what's going on here: if you pickled a str in python 2,
then python 3 wants to unpickle it as a str. But in python 2 str was a
vector of arbitrary bytes in some assumed encoding, and in python 3 str is
a vector of Unicode characters. So it needs to know what encoding to use,
which is fine and what you'd expect for the py2->py3 transition.

But: when pickling arrays, numpy on py2 used a str to store the raw memory
of your array. Trying to run this data through a character decoder then
obviously makes a mess of everything. So the fundamental problem is that on
py2, there's no way to distinguish between a string of text and a string of
bytes -- they're encoded in exactly the same way in the pickle file -- and
the python 3 unpickler just has to guess. You can tell it to guess in a way
that works for raw bytes -- that's what the encoding= options Pauli
mentions above do -- but obviously this will then be incorrect if you have
any actual non-latin1 textual strings in your pickle, and you can't get it
to handle both correctly at the same time.

If you're desperate, it should be possible to get your data out of py2
pickles by loading then with one of the encoding options above, and then
going through the resulting object and converting all the actual textual
strings back to the correct encoding by hand. No data is actually lost. And
of course even this is unnecessary if your file contains only ASCII/latin1.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150824/acb4d13f/attachment.html>