writing large dictionaries to file using cPickle

perfreem at gmail.com perfreem at gmail.com
Wed Jan 28 17:43:16 EST 2009


On Jan 28, 5:14 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 29, 3:13 am, perfr... at gmail.com wrote:
>
>
>
> > hello all,
>
> > i have a large dictionary which contains about 10 keys, each key has a
> > value which is a list containing about 1 to 5 million (small)
> > dictionaries. for example,
>
> > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > 'world'}, ...],
> >                 key2: [...]}
>
> > in total there are about 10 to 15 million lists if we concatenate
> > together all the values of every key in 'mydict'. mydict is a
> > structure that represents data in a very large file (about 800
> > megabytes).
>
> > what is the fastest way to pickle 'mydict' into a file? right now i am
> > experiencing a lot of difficulties with cPickle when using it like
> > this:
>
> > from cPickle import pickle
> > pfile = open(my_file, 'w')
> > pickle.dump(mydict, pfile)
> > pfile.close()
>
> > this creates extremely large files (~ 300 MB) though it does so
> > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> > it gets slower and slower. it takes almost an hour if not more to
> > write this pickle object to file.
>
> > is there any way to speed this up? i dont mind the large file... after
> > all the text file with the data used to make the dictionary was larger
> > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> > i do care about speed...
>
> > i have tried optimizing this by using this:
>
> > s = pickle.dumps(mydict, 2)
> > pfile.write(s)
>
> > but this takes just as long... any ideas ? is there a different module
> > i could use that's more suitable for large dictionaries ?
> > thank you very much.
>
> Pardon me if I'm asking the "bleedin' obvious", but have you checked
> how much virtual memory this is taking up compared to how much real
> memory you have? If the slowness is due to pagefile I/O, consider
> doing "about 10" separate pickles (one for each key in your top-level
> dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?



More information about the Python-list mailing list