huge dictionary -> bsddb/pickle question

Alex Martelli aleax at mac.com
Fri Jun 15 10:40:33 EDT 2007


lazy <arunmail at gmail.com> wrote:
   ...
> key1=>{key11=>[1,2] , key12=>[6,7] , ....  }
   ...
> Im processesing  HUGE(~100M inserts into the dictionary) data.

What will you need from the "saved" version later?  If you need lookups
by single keys or key pairs then a relational DB may be best; if you
need to recover the whole huge dictionary into memory (assuming it will
fit, e.g. you have a 64-bit machine with ample RAM) then it might be
more efficient to "dump" things into a custom-designed format.  If I was
uncertain about future uses of the saved data, I'd go for the maximal
flexibility afforded by a relational DB -- if you save to a relational
DB you can later easily use the data in every which way, including
ad-hoc ones (such flexibility is a key architectural feature of
relational databases).  E.g., a single table with 4 fields might be
enough: key, subkey, leftint, rightint -- each field of the appropriate
type (you do tell us that the two ints are smallish integers, but we
don't know directly about the types of the keys -- besides a hint that
the main keys are presumably strings of some kind since that's what
bsddb likes as its own keys, but still, choosing the right kind of
string may be important in order to save space and obtain good
performance).  The (key,subkey) pair would be the primary key on that
table, and unless and until you need peculiar "navigation" in the future
you may not require other indices, constraints, or whatever.

At the scale you're talking about (multi-gigabyte, it appears) it's
probably worth using a powerful DB (such as PostgreSQL, interfacing it
to Python via psycopg2) rather than a lightweight one (such as sqlite,
which comes with Python 2.5); however, depending on how critical
performance is, such issues are often best resolved by benchmarking
(similarly, you might benchmark the performance for inserting into the
properly constrained table from the start, vs originally making the
table constraints-less and using an ALTER TABLE later to add the primary
key constraint/index that will be useful for later lookups; "bulk
insertions" into DBs can often benefit from the latter idea).


Alex



More information about the Python-list mailing list