writing large dictionaries to file using cPickle

Wed Jan 28 17:14:36 EST 2009

On Jan 29, 3:13 am, perfr... at gmail.com wrote:
> hello all,
>
> i have a large dictionary which contains about 10 keys, each key has a
> value which is a list containing about 1 to 5 million (small)
> dictionaries. for example,
>
> mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> 'world'}, ...],
>                 key2: [...]}
>
> in total there are about 10 to 15 million lists if we concatenate
> together all the values of every key in 'mydict'. mydict is a
> structure that represents data in a very large file (about 800
> megabytes).
>
> what is the fastest way to pickle 'mydict' into a file? right now i am
> experiencing a lot of difficulties with cPickle when using it like
> this:
>
> from cPickle import pickle
> pfile = open(my_file, 'w')
> pickle.dump(mydict, pfile)
> pfile.close()
>
> this creates extremely large files (~ 300 MB) though it does so
> *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> it gets slower and slower. it takes almost an hour if not more to
> write this pickle object to file.
>
> is there any way to speed this up? i dont mind the large file... after
> all the text file with the data used to make the dictionary was larger
> (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> i do care about speed...
>
> i have tried optimizing this by using this:
>
> s = pickle.dumps(mydict, 2)
> pfile.write(s)
>
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).