Large Dictionaries

Mon May 15 09:11:21 EDT 2006

In article <1147699064.107490 at teuthos>, Chris Foote <chris at foote.com.au> 
wrote:

> Hi all.
> 
> I have the need to store a large (10M) number of keys in a hash table,
> based on a tuple of (long_integer, integer).  The standard python
> dictionary works well for small numbers of keys, but starts to
> perform badly for me inserting roughly 5M keys:
> 
> # keys   dictionary  metakit   (both using psyco)
> ------   ----------  -------
> 1M            8.8s     22.2s
> 2M           24.0s     43.7s
> 5M          115.3s    105.4s

Are those clock times or CPU times?

How much memory is your process using and how much is available on your 
machine?

I'm guessing a integer takes 4 bytes and a long integer takes roughly one 
byte per two decimal digits.  Plus a few more bytes to bundle them up into 
a tuple.  You've probably got something like 20 bytes per key, so 5M of 
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe 
half full, and figure a hash key is 4 bytes, so that's another 40 meg for 
the hash table itself.

Plus whatever the values stored in the dictionary take up.  Even if you're 
not storing any values (i.e., there's probably 4 bytes for a null pointer 
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably 
conservative estimates of what various objects must take up in memory, but 
you see where this is going.  We're already up to 180 meg.  I wonder if the 
whole problem is that you're just paging yourself to death.  A few minutes 
watching your system memory performance with ps or top while your program 
is running might give you some idea if this is the case.