writing large dictionaries to file using cPickle
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Wed Jan 28 21:11:33 EST 2009
En Wed, 28 Jan 2009 14:13:10 -0200, <perfreem at gmail.com> escribió:
> i have a large dictionary which contains about 10 keys, each key has a
> value which is a list containing about 1 to 5 million (small)
> dictionaries. for example,
>
> mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> 'world'}, ...],
> key2: [...]}
>
> [pickle] creates extremely large files (~ 300 MB) though it does so
> *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> it gets slower and slower. it takes almost an hour if not more to
> write this pickle object to file.
There is an undocumented Pickler attribute, "fast". Usually, when the same
object is referenced more than once, only the first appearance is stored
in the pickled stream; later references just point to the original. This
requires the Pickler instance to remember every object pickled so far --
setting the "fast" attribute to a true value bypasses this check. Before
using this, you must be positively sure that your objects don't contain
circular references -- else pickling will never finish.
py> from cPickle import Pickler
py> from cStringIO import StringIO
py> s = StringIO()
py> p = Pickler(s, -1)
py> p.fast = 1
py> x = [1,2,3]
py> y = [x, x, x]
py> y
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y[0] is y[1]
True
py> p.dump(y)
<cPickle.Pickler object at 0x00BC0E48>
py> s.getvalue()
'\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.'
Note that, when unpickling, shared references are broken:
py> s.seek(0,0)
py> from cPickle import load
py> y2 = load(s)
py> y2
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y2[0] is y2[1]
False
--
Gabriel Genellina
More information about the Python-list
mailing list