writing large dictionaries to file using cPickle

perfreem at gmail.com perfreem at gmail.com
Fri Jan 30 15:44:07 EST 2009


On Jan 28, 6:08 pm, Aaron Brady <castiro... at gmail.com> wrote:
> On Jan 28, 4:43 pm, perfr... at gmail.com wrote:
>
> > On Jan 28, 5:14 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > > On Jan 29, 3:13 am, perfr... at gmail.com wrote:
>
> > > > hello all,
>
> > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > value which is a list containing about 1 to 5 million (small)
> > > > dictionaries. for example,
>
> > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > > 'world'}, ...],
> > > >                 key2: [...]}
>
> > > > in total there are about 10 to 15 million lists if we concatenate
> > > > together all the values of every key in 'mydict'. mydict is a
> > > > structure that represents data in a very large file (about 800
> > > > megabytes).
>
> snip
>
> > in reply to the other poster: i thought 'shelve' simply calls pickle.
> > if thats the case, it wouldnt be any faster, right ?
>
> Yes, but not all at once.  It's a clear winner if you need to update
> any of them later, but if it's just write-once, read-many, it's about
> the same.
>
> You said you have a million dictionaries.  Even if each took only one
> byte, you would still have a million bytes.  Do you expect a faster I/
> O time than the time it takes to write a million bytes?
>
> I want to agree with John's worry about RAM, unless you have several+
> GB, as you say.  You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.



More information about the Python-list mailing list