Initializing the number of slots in a dictionary

Mon Aug 7 15:08:11 EDT 2006

On Mon, 07 Aug 2006 00:33:33 -0400, Tim Peters wrote:

> ...
> 
> [Jon Smirl]
>> I know in advance how many items will be added to the dictionary. Most
>> dictionary implementations I have previously worked with are more
>> efficient if they know ahead of time how big to make their tables.
> 
> Richard Jones spent considerable time investigating whether "pre-sizing"
> lists and dicts in CPython could help, at the "Need For Speed" sprint
> earlier this year.   He didn't find a win worth getting; e.g., read the
> section "List pre-allocation" at:
> 
>     http://wiki.python.org/moin/NeedForSpeed/Failures
> 
> Trying it for dicts was also part of what he did, but I don't think
> specifics about that were recorded on the Wiki.  I was at the sprint, and
> hoots of triumph from Richard's direction were conspicuous by absence
> during his dict time ;-)
> 
>> In this case I only need to do a presence test of the key, there is no
>> actual data associated with the key. The table is used for detecting
>> duplicate entries. Is there a more efficient to do this test that
>> sticking an empty string into a dict? The keys are sha1.digest().
> 
> It's probably more common to set the value to None or 1, but it doesn't
> really matter in reality.  In theory, using 1 or an empty string relies on
> the implementation accident that CPython stores those uniquely, while it's
> guaranteed that None is a singleton object.
> 
> BTW, is there a reason to use SHA instead of MD5?  I ask because the
> latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

http://git.or.cz/index.html
git is the source code control tool used by the Linux kernel.  It is 
optimized for distributed development. git is what the kernel developers
wrote to replace Bitkeeper after the Bitkeeper license change.

git identifies it's change sets with sha1 hashes.

Distributed SCM is very important when working on large projects. With a 
distributed SCM nobody needs commit privs - everyone has their own
complete copy of the repository. To get a change into a distribution you
need to convince someone up the heirarchy to pull your changes into
their repository. In the Linux world Linus makes the distributions. There
are about 10 people in the next circle and about 100 in the circle after
that. You need to convince one of them to accept your changes, but that's
not very hard if your code makes sense.

Distributed also means that you can pull changes from anyone else into
your repository. So if you want to run Reiser4 and it isn't in the main
Linus tree yet, just pull a copy from the namesys git tree into your local
tree and everything will get merged.

Python is using Subversion which is way better than CVS, but Subversion is
still organized around a central repository with commit privs. If that
repository disappears in an earthquake Python may be in trouble.

If you give git a try you will also notice that it is way faster than
subversion.

> If you're using a current version of Python, you can save some memory by
> using a builtin set object instead.  The CPython implementation of sets is
> very much like its implementation of dicts, but doesn't consume any memory
> for the (non-existent) associated values.  You also get a number of
> operations useful on sets (like intersection and union).

Sets may be a good option, I'll adjust the code for the next run.

>> ...
>> Since I am rerunning the conversion over and over anything simple that
>> speeds it up is helpful.
>>
>> I already have 4GB RAM, it would take around 30GB to get everything in
>> memory.
>>
>> Dictionaries are not a big problem for me, but there are many in use and
>> they have millions of items in them.
> 
> A peculiar suggestion:  if you don't need to release system resources
> cleanly at the end of a run, try doing:
> 
>     import os
>     os._exit(0)
> 
> at the end.  If you have dicts with millions of entries swapped to disk,
> it /can/ consume major time just to page them all in again to decrement
> the key & value refcounts if you let "clean shutdown" code determine
> they've become trash.  Bailing ungracefully can skip all that work (but
> also skips other work, like letting the platform C I/O library close
> still-open files gracefully).

Jon Smirl
jonsmirl at gmail.com