[Python-Dev] Unpickling memory usage problem, and a proposed solution

Dan Gindikin dgindikin at gmail.com
Fri Apr 23 23:44:50 CEST 2010


Collin Winter <collinwinter <at> google.com> writes:
> I don't think it's possible in general to remove any PUTs if the
> pickle is being written to a file-like object. It is possible to reuse
> a single Pickler to pickle multiple objects: this causes the Pickler's
> memo dict to be shared between the objects being pickled. If you
> pickle foo, bar, and baz, foo may not have any GETs, but bar and baz
> may have GETs that reference data added to the memo by foo's PUT
> operations. Because you can't know what will be written to the
> file-like object later, you can't remove any of the PUT instructions
> in this scenario.

Hmm, that is a good point. A possible solution would be for the
two-pass optimizer to scan through the entire file, going right
through '.' opcodes. That would deal with the case you are
describing, but not if the user "maliciously" wrote some other
stuff into the file in between pickle dumps, all the while reusing
the same pickler.

I think a better solution would be to make sure that the '.' is
the last thing in the file and die otherwise. This would at least
ensure correctness and detection of cases that this thing could
not handle.

> don't break cvs2svn, it's not fun
> to fix :). I added some basic tests for this support in cPython's
> Lib/test/pickletester.py.

Thanks for the warning :)

> There might be room for app-specific optimizations that do this, but
> I'm not sure it would work for a general-usage cPickle that needs to
> stay compatible with the current system.

That may well be true. Still, when trying to deal with large data
you really need something like this. Our situation was made worse because
we had a extension types. As they were allocated they got interspersed
with temporaries generated by the spurious PUTs, and that is what
really fragmented the memory. However its probably not a stretch to
assume that if you are dealing with large stuff through python you are
going to have extension types in the mix.






More information about the Python-Dev mailing list