Initializing the number of slots in a dictionary

Sun Aug 6 23:58:31 EDT 2006

On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote:

> Jon Smirl wrote:
>> Is there some way to tell a dictionary object that I am going to load 1M
>> objects into it and have it pre-allocate enought slots to hold all of
>> the entries?
> 
> Not according to the manual.
> 
> Not according to the source [as at 2.4.3]. In any case, if there were a
> back-door undocumented arg for the dict constructor, somebody would have
> read the source and spread the news.
> 
>> Thus avoiding many thousand memory allocations.
> 
> What gives you the impression that "many thousand memory allocations" are
> involved? {My impression is that the dict would be resized about 11 times
> on its trip from size 8 to size 2M, and each resize would involve one
> allocation.]
> 
> Do you have an application with a performance problem? If so, what makes
> you think inserting 1M items into a Python dict is contributing to the
> problem?

I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.

In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that sticking
an empty string into a dict? The keys are sha1.digest().

> 
> Have you read the thread "Large Dictionaries" which started on/about
> 2006-05-15?

I'll look at this.

> In general, what is the background to your question?

I am in the process of modifying cvs2svn to do cvs2git. The purpose of
this is to convert Mozilla CVS (4GB) into git format. Currently a
conversion takes 5 hours but the code isn't finished yet. 

Other programs that attempt to process the Mozilla CVS repository take
several days to run. But none of the other programs can successful convert
the Mozilla repository without encountering errors. cvs2svn can successful
convert the repository to subversion but it takes about a week. It is
obvious that a lot of effort has been put into teaching cvs2svn how to
deal with broken CVS repositories.

I do practice on smaller repositories but some of the problems I am
dealing with only occur when processing the full dataset (it contains
weird CVS errors). In general the conversion process is completely IO
bound. Since I am rerunning the conversion over and over anything simple
that speeds it up is helpful. I already have 4GB RAM, it would take around
30GB to get everything in memory.

Dictionaries are not a big problem for me, but there are many in use and
they have millions of items in them. But first I have to get the code for
converting to git working, then I can worry about how long it runs. Two or
three days is ok, a month is not ok.

Jon Smirl
jonsmirl at gmail.com