Large Dictionaries

Tue May 16 04:03:36 EDT 2006

Chris Foote wrote:

> > Don't forget that adding keys
>> requires resizing the dict, which is a moderately expensive operation.
> 
> Yep, that's why I probably need a dictionary where I can pre-specify
> an approximate size at the time of its creation.

Try splitting the creation of the keys from the creation of the dictionary 
and you'll see where the bottleneck really is.

On my system:

>>> r = range(5000000)
>>> d = dict.fromkeys(r)

takes less than 1 second to create the dictionary. So creating a dictionary 
with 5 million elements is not in itself a terribly expensive operation 
(and yes, even with fromkeys there is no attempt to work out in advance 
what size the dictionary should be).

Trying to simulate your data:

>>> scale = 2**32
>>> data = [ (i*scale,i) for i in range(5000000) ]
>>> d = dict.fromkeys(data)

the assignment to data took 42 seconds. Creating the dictionary took 6 
seconds.

Trying the variation someone else suggested:

>>> scale = 2**32
>>> data = [ i*scale+i for i in range(5000000) ]
>>> d = dict.fromkeys(data)

creating data took under 6 seconds and about 1 second for d.

so it looks like the issue is not creating a large dictionary, nor lots of 
integers, but creating lots of tuples is expensive.

Oh, and if anyone tells you that the intermediate list I created in the 
tests above is going to make it slow and you should use iterators instead:

>>> r = range(5000000)
>>> scale = 2**32
>>> s = [ i*scale for i in r ]
>>> from itertools import izip
>>> d = dict.fromkeys(izip(s,r))

The first few lines took a couple of seconds, the last one 1 minute 50 
seconds. The alternative version:

>>> scale = 2**32
>>> r = range(5000000)
>>> d = dict.fromkeys((i*scale,i) for i in r)

takes a similar time.