writing large dictionaries to file using cPickle

Fri Jan 30 19:47:50 EST 2009

On Jan 30, 2:44 pm, perfr... at gmail.com wrote:
> On Jan 28, 6:08 pm, Aaron Brady <castiro... at gmail.com> wrote:
>
>
>
> > On Jan 28, 4:43 pm, perfr... at gmail.com wrote:
>
> > > On Jan 28, 5:14 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > > > On Jan 29, 3:13 am, perfr... at gmail.com wrote:
>
> > > > > hello all,
>
> > > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > > value which is a list containing about 1 to 5 million (small)
> > > > > dictionaries. for example,
>
> > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > > > 'world'}, ...],
> > > > >                 key2: [...]}
>
> > > > > in total there are about 10 to 15 million lists if we concatenate
> > > > > together all the values of every key in 'mydict'. mydict is a
> > > > > structure that represents data in a very large file (about 800
> > > > > megabytes).
>
> > snip
>
> > > in reply to the other poster: i thought 'shelve' simply calls pickle.
> > > if thats the case, it wouldnt be any faster, right ?
>
> > Yes, but not all at once.  It's a clear winner if you need to update
> > any of them later, but if it's just write-once, read-many, it's about
> > the same.
>
> > You said you have a million dictionaries.  Even if each took only one
> > byte, you would still have a million bytes.  Do you expect a faster I/
> > O time than the time it takes to write a million bytes?
>
> > I want to agree with John's worry about RAM, unless you have several+
> > GB, as you say.  You are not dealing with small numbers.
>
> in my case, i just write the pickle file once and then read it in
> later. in that case, cPickle and shelve would be identical, if i
> understand correctly?

No not identical.  'shelve' is not a dictionary, it's a database
object that implements the mapping protocol.  'isinstance( shelve,
dict )' is False, for example.

> the file i'm reading in is ~800 MB file, and the pickle file is around
> 300 MB. even if it were 800 MB, it doesn't make sense to me that
> python's i/o would be that slow... it takes roughly 5 seconds to write
> one megabyte of a binary file (the pickled object in this case), which
> just seems wrong. does anyone know anything about this? about how i/o
> can be sped up for example?

You can try copying a 1-MB file.  Or something like:

f= open( 'temp.temp', 'w' )
for x in range( 100000 ):
    f.write( '0'* 10 )

You know how long it takes OSes to boot, right?

> the dictionary might have a million keys, but each key's value is very
> small. i tried the same example where the keys are short strings (and
> there are about 10-15 million of them) and each value is an integer,
> and it is still very slow. does anyone know how to test whether i/o is
> the bottle neck, or whether it's something specific about pickle?
>
> thanks.

You could fall back to storing a parallel list by hand, if you're just
using string and numeric primitives.