Initializing the number of slots in a dictionary

Mon Aug 7 00:33:33 EDT 2006

...

[Jon Smirl]
> I know in advance how many items will be added to the dictionary. Most
> dictionary implementations I have previously worked with are more
> efficient if they know ahead of time how big to make their tables.

Richard Jones spent considerable time investigating whether
"pre-sizing" lists and dicts in CPython could help, at the "Need For
Speed" sprint earlier this year.   He didn't find a win worth getting;
e.g., read the section "List pre-allocation" at:

    http://wiki.python.org/moin/NeedForSpeed/Failures

Trying it for dicts was also part of what he did, but I don't think
specifics about that were recorded on the Wiki.  I was at the sprint,
and hoots of triumph from Richard's direction were conspicuous by
absence during his dict time ;-)

> In this case I only need to do a presence test of the key, there is no
> actual data associated with the key. The table is used for detecting
> duplicate entries. Is there a more efficient to do this test that sticking
> an empty string into a dict? The keys are sha1.digest().

It's probably more common to set the value to None or 1, but it
doesn't really matter in reality.  In theory, using 1 or an empty
string relies on the implementation accident that CPython stores those
uniquely, while it's guaranteed that None is a singleton object.

BTW, is there a reason to use SHA instead of MD5?  I ask because the
latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

If you're using a current version of Python, you can save some memory
by using a builtin set object instead.  The CPython implementation of
sets is very much like its implementation of dicts, but doesn't
consume any memory for the (non-existent) associated values.  You also
get a number of operations useful on sets (like intersection and
union).

> ...
> Since I am rerunning the conversion over and over anything simple that speeds
> it up is helpful.
>
> I already have 4GB RAM, it would take around 30GB to get everything in memory.
>
> Dictionaries are not a big problem for me, but there are many in use and
> they have millions of items in them.

A peculiar suggestion:  if you don't need to release system resources
cleanly at the end of a run, try doing:

    import os
    os._exit(0)

at the end.  If you have dicts with millions of entries swapped to
disk, it /can/ consume major time just to page them all in again to
decrement the key & value refcounts if you let "clean shutdown" code
determine they've become trash.  Bailing ungracefully can skip all
that work (but also skips other work, like letting the platform C I/O
library close still-open files gracefully).