[Numpy-discussion] py2/py3 pickling

Mon Aug 24 18:15:26 EDT 2015

Hi-

Would it be possible then (in relatively short order) to create a py2 -> py3 numpy pickle converter? This would run in py2, np.load or unpickle a pickle in the usual way and then repickle and/or save using a pickler that uses an explicit pickle type for encoding the bytes associated with numpy dtypes. The numpy unpickler in py3 would then know what to do. IE. is there a way to make the numpy py2 pickler be explicit about byte strings? Presumably this would cover most use-cases even for complicated pickled objects and could be used transparently within py2 or py3.

Best, C

> On Aug 24, 2015, at 2:30 PM, Nathaniel Smith <njs at pobox.com> wrote:
> 
> On Aug 24, 2015 9:29 AM, "Pauli Virtanen" <pav at iki.fi <mailto:pav at iki.fi>> wrote:
> >
> > 24.08.2015, 01:02, Chris Laumann kirjoitti:
> > [clip]
> > > Is there documentation about the limits and workarounds for py2/py3
> > > pickle/np.save/load compatibility? I haven't found anything except
> > > developer bug tracking discussions (eg. #4879 in github numpy).
> >
> > Not sure if it's written down somewhere but:
> >
> > - You should consider pickles not portable between Py2/3.
> >
> > - Setting encoding='bytes' or encoding='latin1' should produce correct
> > results for numerical data. However, neither is "safe" because the
> > option also affects other data than numpy arrays that you may have
> > possibly saved.
> 
> For those wondering what's going on here: if you pickled a str in python 2, then python 3 wants to unpickle it as a str. But in python 2 str was a vector of arbitrary bytes in some assumed encoding, and in python 3 str is a vector of Unicode characters. So it needs to know what encoding to use, which is fine and what you'd expect for the py2->py3 transition.
> 
> But: when pickling arrays, numpy on py2 used a str to store the raw memory of your array. Trying to run this data through a character decoder then obviously makes a mess of everything. So the fundamental problem is that on py2, there's no way to distinguish between a string of text and a string of bytes -- they're encoded in exactly the same way in the pickle file -- and the python 3 unpickler just has to guess. You can tell it to guess in a way that works for raw bytes -- that's what the encoding= options Pauli mentions above do -- but obviously this will then be incorrect if you have any actual non-latin1 textual strings in your pickle, and you can't get it to handle both correctly at the same time.
> 
> If you're desperate, it should be possible to get your data out of py2 pickles by loading then with one of the encoding options above, and then going through the resulting object and converting all the actual textual strings back to the correct encoding by hand. No data is actually lost. And of course even this is unnecessary if your file contains only ASCII/latin1.
> 
> -n
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150824/cbb67c2c/attachment.html>