[Python-Dev] Unpickling memory usage problem, and a proposed solution

Fri Apr 23 23:18:13 CEST 2010

On Fri, Apr 23, 2010 at 1:53 PM, Alexandre Vassalotti
<alexandre at peadrop.com> wrote:
> On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin <dgindikin at gmail.com> wrote:
>> This wouldn't help our use case, your code needs the entire pickle
>> stream to be in memory, which in our case would be about 475mb, this
>> is on top of the 300mb+ data structures that generated the pickle
>> stream.
>>
>
> In that case, the best we could do is a two-pass algorithm to remove
> the unused PUTs. That won't be efficient, but it will satisfy the
> memory constraint. Another solution is to not generate the PUTs at all
> by setting the 'fast' attribute on Pickler. But that won't work if you
> have a recursive structure, or have code that requires that the
> identity of objects to be preserved.

I don't think it's possible in general to remove any PUTs if the
pickle is being written to a file-like object. It is possible to reuse
a single Pickler to pickle multiple objects: this causes the Pickler's
memo dict to be shared between the objects being pickled. If you
pickle foo, bar, and baz, foo may not have any GETs, but bar and baz
may have GETs that reference data added to the memo by foo's PUT
operations. Because you can't know what will be written to the
file-like object later, you can't remove any of the PUT instructions
in this scenario.

This kind of thing is done in real-world code like cvs2svn (which I
broke when I was optimizing cPickle; don't break cvs2svn, it's not fun
to fix :). I added some basic tests for this support in cPython's
Lib/test/pickletester.py.

There might be room for app-specific optimizations that do this, but
I'm not sure it would work for a general-usage cPickle that needs to
stay compatible with the current system.

Collin Winter