Large Dictionaries

Tue May 16 10:18:05 EDT 2006

Chris Foote wrote:
> Claudio Grondi wrote:
> 
>> Chris Foote wrote:
>>
>>> However, please note that the Python bsddb module doesn't support
>>> in-memory based databases - note the library documentation's[1] wording:
>>>
>>>     "Files never intended to be preserved on disk may be created 
>>> by      passing None as the filename."
>>>
>>> which closely mirrors the Sleepycat documentation[2]:
>>>
>>>     "In-memory databases never intended to be preserved on disk 
>>> may         be created by setting the file parameter to NULL."
>>>
>>> It does actually use a temporary file (in /var/tmp), for which 
>>> performance for my purposes is unsatisfactory:
>>>
>>> # keys   dictionary  metakit  bsddb  (all using psyco)
>>> ------   ----------  -------  -----
>>> 1M            8.8s     22.2s  20m25s[3]
>>> 2M           24.0s     43.7s  N/A
>>> 5M          115.3s    105.4s  N/A
>>>
>>> Cheers,
>>> Chris
>>>
>>> [1] bsddb docs:
>>>     http://www.python.org/doc/current/lib/module-bsddb.html
>>>
>>> [2] Sleepycat BerkeleyDB C API:
>>>     http://www.sleepycat.com/docs/api_c/db_open.html
>>>
>>> [3] Wall clock time.  Storing the (long_integer, integer) key in 
>>> string form "long_integer:integer" since bsddb doesn't support keys 
>>> that aren't integers or strings.
> 
>  >
> 
>> I have to admit, that I haven't wrote any own code to actually test 
>> this, but if 20m25s for storing of a single MByte of strings in a 
>> database table index column is really what you are getting, I can't 
>> get rid of the feeling, that there is something elementary wrong with 
>> your way doing it.
> 
> 
> Hi Claudio.
> 
> 1M is one million, referring to the number of insertions of keys;
> not a Megabyte.  I'm sorry that you took it that way :-(
> 
> Berkeley DB is great for accessing data by key for things already
> stored on disk (i.e. read access), but write performance for each
> key-value pair is slow due to it being careful about flushing
> writes to disk by default.
> 
>> Posting the code for your test cases appears to me to be the only 
>> option to see what is the reason for the mystery you are getting here 
>> (this will clarify also the other mysterious things considered by the 
>> posters to this thread up to now).
> 
> 
> I agree that posting some test code would have proved useful, but
> the code is big and has too many interdependencies on external
> things (e.g. databases, threads & pyro RPC calls) to allow me
> to separate out examples easily.  But if you go back to my
> original posting, I think my question was quite clear.
> 
> Best regards,
> Chris
> 
I have some code demonstrating the mystery from my point of view 
(including timings which differ from what you experience in the order of 
a magnitude on my 2.8 GHz P4):

dctA = {}
import time
strt = time.clock()
lst = range(123456789010000000L, 123456789010000000L + 5000000)
import random
endt = time.clock()
print "lst = range(12345678901L, 12345678901L + 5000000) [s] : ", endt - 
strt
strt = time.clock()
random.shuffle(lst)
endt = time.clock()
print "random.shuffle(lst)                               [s] : ", endt - 
strt
random.shuffle(lst)
counter = 0
for item in lst:
   if (counter == 0):
     print "Listing of some of lst items:"
   if (counter > 4):
     print ":END of listing of lst items"
     break
   else:
     print "item no. %i "%(counter,), repr(item)
   counter += 1
#:for
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA.fromkeys(lst, None)
endt = time.clock()
print "dctA.fromkeys(lst, None)                          [s] : ", endt - 
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
strLst = []
for item in lst: strLst.append(str(item))
dctA = {}
endt = time.clock()
print "for item in lst: strLst.append(str(item))         [s] : ", endt - 
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA = dctA.fromkeys(strLst, None)
endt = time.clock()
print "dctA.fromkeys(strLst, None)                       [s] : ", endt - 
strt
# raw_input('continue with ENTER) #>')
print "len(dctA) : %i " % (len(dctA),)
counter = 0
for key in dctA.keys():
   if (counter == 0):
     print "Listing of some of dctA items:"
   if (counter > 4):
     print ":END of listing of dctA items"
     break
   else:
     print "key  no. %i "%(counter,), repr(key)
   counter += 1
#:for
raw_input('exit with ENTER) #>')

# Gives as output :
#
# lst = range(12345678901L, 12345678901L + 5000000) [s] :  0.81327347470
# random.shuffle(lst)                               [s] :  8.06178829991
# Listing of some of lst items:
# item no. 0  123456789011010733L
# item no. 1  123456789013585761L
# item no. 2  123456789013610266L
# item no. 3  123456789011311029L
# item no. 4  123456789010968853L
# :END of listing of lst items
# dctA.fromkeys(lst, None)                          [s] :  3.11773256098
# for item in lst: strLst.append(str(item))         [s] :  11.5454232312
# dctA.fromkeys(strLst, None)                       [s] :  3.98027849908
# len(dctA) : 5000000
# Listing of some of dctA items:
# key  no. 0  '123456789014515319'
# key  no. 1  '123456789014116699'
# key  no. 2  '123456789014116698'
# key  no. 3  '123456789010800915'
# key  no. 4  '123456789014116696'
# :END of listing of dctA items

So the insertion into the dictionary is according to you:
# keys   dictionary
5M          115.3s
according to me:
5M            3.9s

I need according to the task manager about 0.5 GByte memory for this 
(including the lists I keep alive).

I conclude from that, that your results from Berkeley approach are also 
a case of unfortunate way of doing things (in Python and with the DB) 
and using the Python dictionaries is here the simplest way to go (I 
suppose that your main problem is, that you are not using .fromkeys() on 
a list of your input data).

Claudio