splitting a large dictionary into smaller ones

Mon Mar 23 08:57:23 EDT 2009

> i have a very large dictionary object that is built from a text file
> that is about 800 MB -- it contains several million keys.  ideally i
> would like to pickle this object so that i wouldnt have to parse this
> large file to compute the dictionary every time i run my program.
> however currently the pickled file is over 300 MB and takes a very
> long time to write to disk - even longer than recomputing the
> dictionary from scratch.
> 
> i would like to split the dictionary into smaller ones, containing
> only hundreds of thousands of keys, and then try to pickle them. is
> there a way to easily do this? 

While others have suggested databases, they may be a bit 
overkill, depending on your needs.  Python2.5+ supplies not only 
the sqlite3 module, but older versions (at least back to 2.0) 
offer the anydbm module (changed to "dbm" in 3.0), allowing you 
to create an on-disk string-to-string dictionary:

   import anydbm
   db = anydbm.open("data.db", "c")

   # populate some data
   # using "db" as your dictionary
   import csv
   f = file("800megs.txt")
   data = csv.reader(f, delimiter='\t')
   data.next()  # discard a header row
   for key, value in data:
     db[key] = value
   f.close()
   print db["some key"]

   db.close()

The resulting DB object is a little sparsely documented, but for 
the most part it can be treated like a dictionary.  The advantage 
is that, if the source data doesn't change, you can parse once 
and then just use your "data.db" file from there out.

-tkc