Segmenting a pickle stream without unpickling

Fri May 19 15:51:28 EDT 2006

[Boris Borcic]
> Assuming that the items of my_stream share no content (they are
> dumps of db cursor fetches), is there a simple way to do the
> equivalent of
>
> def pickles(my_stream) :
>      from cPickle import load,dumps
>      while 1 :
>          yield dumps(load(my_stream))
>
> without the overhead associated with unpickling objects
> just to pickle them again ?

cPickle (but not pickle.py) Unpickler objects have a barely documented
noload() method.  This "acts like" load(), except doesn't import
modules or construct objects of user-defined classes.  The return
value of noload() is undocumented and usually useless.  ZODB uses it a
lot ;-)

Anyway, that can go much faster than load(), and works even if the
classes and modules referenced by pickles aren't available in the
unpickling environment.  It doesn't return the individual pickle
strings, but they're easy to get at by paying attention to the file
position between noload() calls.  For example,

import cPickle as pickle
import os

# Build a pickle file with 4 pickles.

PICKLEFILE = "temp.pck"

class C:
    pass

f = open(PICKLEFILE, "wb")
p = pickle.Pickler(f, 1)

p.dump(2)
p.dump([3, 4])
p.dump(C())
p.dump("all done")

f.close()

# Now use noload() to extract the 4 pickle
# strings in that file.

f = open(PICKLEFILE, "rb")
limit = os.path.getsize(PICKLEFILE)
u = pickle.Unpickler(f)
pickles = []
pos = 0
while pos < limit:
    u.noload()
    thispos = f.tell()
    f.seek(pos)
    pickles.append(f.read(thispos - pos))
    pos = thispos

from pprint import pprint
pprint(pickles)

That prints a list containing the 4 pickle strings:

['K\x02.',
 ']q\x01(K\x03K\x04e.',
 '(c__main__\nC\nq\x02o}q\x03b.',
 'U\x08all doneq\x04.']

You could do much the same by calling pickletools.dis() and ignoring
its output, but that's likely to be slower.