writing large dictionaries to file using cPickle

Wed Jan 28 21:11:33 EST 2009

En Wed, 28 Jan 2009 14:13:10 -0200, <perfreem at gmail.com> escribió:

> i have a large dictionary which contains about 10 keys, each key has a
> value which is a list containing about 1 to 5 million (small)
> dictionaries. for example,
>
> mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> 'world'}, ...],
>                 key2: [...]}
>
> [pickle] creates extremely large files (~ 300 MB) though it does so
> *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> it gets slower and slower. it takes almost an hour if not more to
> write this pickle object to file.

There is an undocumented Pickler attribute, "fast". Usually, when the same  
object is referenced more than once, only the first appearance is stored  
in the pickled stream; later references just point to the original. This  
requires the Pickler instance to remember every object pickled so far --  
setting the "fast" attribute to a true value bypasses this check. Before  
using this, you must be positively sure that your objects don't contain  
circular references -- else pickling will never finish.

py> from cPickle import Pickler
py> from cStringIO import StringIO
py> s = StringIO()
py> p = Pickler(s, -1)
py> p.fast = 1
py> x = [1,2,3]
py> y = [x, x, x]
py> y
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y[0] is y[1]
True
py> p.dump(y)
<cPickle.Pickler object at 0x00BC0E48>
py> s.getvalue()
'\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.'

Note that, when unpickling, shared references are broken:

py> s.seek(0,0)
py> from cPickle import load
py> y2 = load(s)
py> y2
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y2[0] is y2[1]
False

-- 
Gabriel Genellina