[Pandas-dev] pickle is evil

Wes McKinney wesmckinn at gmail.com
Tue Apr 23 20:52:46 CEST 2013

On Sun, Apr 21, 2013 at 6:19 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> I realized I didn't answer your question
> this just catches on pickle.load
> try:
>    pickle.load
> except (TypeError):
>     pickle_compat.load
> except:
>     if not PY3:
>          raise
>     # try to I unpickle with an encoding here
> On Apr 21, 2013, at 9:12 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>> avro (better choice that msgpack I think)
>> will be very straightforward add on
>> the format should prob be done independently of internals anyhow at the price of a bit more code, or could store block managers and be somewhat code simpler
>> On Apr 21, 2013, at 9:01 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>> On Sun, Apr 21, 2013 at 3:01 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>>> I thought I'd share a particularly evil pickle issue. In my refactor of
>>>> Series to not subclass ndarray, the new pickling tests were breaking. No
>>>> suprise
>>>> because I changed __getstate__ to pickle via the BlockManager. In order to
>>>> ensure compat I thought I could just fix __setstate__ and figure out what to
>>>> do
>>>> based on the return state (e.g. the len of the state returned as a tuple or
>>>> dict or whatever).
>>>> But no...apparently the reconstruction algorithm takes the class name that
>>>> it see and tries to create it w/o using __new__ (or anything else that you
>>>> can intercept),
>>>> it uses a builtin method called _reconstruct (which is a builtin, but I
>>>> can't figure out how to override it at all, must be only c-code).
>>>> And then numpy gets ahold of it (as its an extension type), and complains
>>>> becuase the class I am trying to instantiate actually isn't a sub-class of
>>>> ndarray
>>>> (which it pre-supposes).
>>>> So, a bit hacky, but using a custom unpickler, then matching on a
>>>> compatbility class (that sub-classes from ndarray), allows me to return the
>>>> correct class.
>>>> The good thing here is that this whole routine isn't even called unless
>>>> there is a TypeError on the original unpickle
>>>> whoosh!
>>>> --------
>>>> # new module: compat/unpickle_compat.py
>>>> import numpy as np
>>>> import pandas
>>>> from pandas.core.series import Series
>>>> from pandas.sparse.series import SparseSeries
>>>> import pickle
>>>> class Unpickler(pickle.Unpickler):
>>>>   pass
>>>> def load_reduce(self):
>>>>   stack = self.stack
>>>>   args = stack.pop()
>>>>   func = stack[-1]
>>>>   if type(args[0]) is type:
>>>>       n = args[0].__name__
>>>>       if n == 'DeprecatedSeries':
>>>>           stack[-1] = object.__new__(Series)
>>>>           return
>>>>       elif n == 'DeprecatedSparseSeries':
>>>>           stack[-1] = object.__new__(SparseSeries)
>>>>           return
>>>>   value = func(*args)
>>>>   stack[-1] = value
>>>> Unpickler.dispatch['R'] = load_reduce
>>>> def load(file):
>>>>   # try to load a compatibility pickle
>>>>   # fake the old class hierarchy
>>>>   # if it works, then return the new type objects
>>>>   try:
>>>>       pandas.core.series.Series = DeprecatedSeries
>>>>       pandas.sparse.series.SparseSeries = DeprecatedSparseSeries
>>>>       with open(file,'rb') as fh:
>>>>           return Unpickler(fh).load()
>>>>   except:
>>>>       raise
>>>>   finally:
>>>>       pandas.core.series.Series = Series
>>>>       pandas.sparse.series.SparseSeries = SparseSeries
>>>> class DeprecatedSeries(Series, np.ndarray):
>>>>   pass
>>>> class DeprecatedSparseSeries(DeprecatedSeries):
>>>>   pass
>>> Yes, pickle is evil. Will this fix affect pickle.loads/pickle.dumps? I
>>> would prefer to get a msgpack or Avro-based serialization format for
>>> Series or DataFrame sorted out before we start gutting the internals
>>> of the objects.
>>> - Wes
The Deprecated hack have to be careful with as there could be
threading issues. Oh boy. I'm not sure how much I want to support
legacy pickles anyway, it would be better to have a release of pandas
that enables pickle -> avro/msgpack serialized form so that people can
migrate all their pickle data to that format, then we can feel free to
break all the pickles, or at least versioning of serialized data
becomes easier (when pickling/unpickling, we just pack the serialized
bytes into the pickle, and that becomes something we can always

Sigh, it's 2013 and I've been talking about fixing the
pickle/serialization problem since 2011, actually even earlier I
think. Weekend project one of these days.

- Wes

