huge dictionary -> bsddb/pickle question

lazy arunmail at gmail.com
Fri Jun 15 04:22:58 EDT 2007


Hi,

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] , ....  }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing  HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

1) Use bsddb. when an insert is done, the db will have key1 as key and
the value(i.e db[key1] will be be pickleled value of dict2). after
1000 inserts , I close and open the db ,inorder to flush the contents
to disk. Also when I try to insert a key, if its already present, I
unpickle the value and change something in dict2 and then pickle it
back to the bsddb.

2)Instead of pickling the value(dict2) and storing in bsddb
immediately, I keep the dict1(outer dictionary in memory) and when it
reaches 1000 inserts, I store it to bsddb as before, pickling each
individual value. The advantage of this is, when one insert is done,
if its already present, I adjust the value and I dont need to unpickle
and pickle it back, if its the memory. If its not present in memory, I
will still need to lookup in bsddb

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)




More information about the Python-list mailing list