splitting a large dictionary into smaller ones

Sun Mar 22 23:10:21 EDT 2009

per wrote:
> hi all,
> 
> i have a very large dictionary object that is built from a text file
> that is about 800 MB -- it contains several million keys.  ideally i
> would like to pickle this object so that i wouldnt have to parse this
> large file to compute the dictionary every time i run my program.
> however currently the pickled file is over 300 MB and takes a very
> long time to write to disk - even longer than recomputing the
> dictionary from scratch.

But you only write it once.  How does the read and reconstruct time 
compare to the recompute time?
> 
> i would like to split the dictionary into smaller ones, containing
> only hundreds of thousands of keys, and then try to pickle them.

Do you have any evidence that this would really be faster?

  is
> there a way to easily do this? i.e. is there an easy way to make a
> wrapper for this such that i can access this dictionary as just one
> object, but underneath it's split into several? so that i can write
> my_dict[k] and get a value, or set my_dict[m] to some value without
> knowing which sub dictionary it's in.

Searching for a key in, say, 10 dicts will be slower than searching for 
it in just one.  The only reason I would do this would be if the dict 
had to be split, say over several machines.  But then, you could query 
them in parallel.

> if there aren't known ways to do this, i would greatly apprciate any
> advice/examples on how to write this data structure from scratch,
> reusing as much of the dict() class as possible.

Terry Jan Reedy