unintuitive dict timings

Sun Oct 12 18:05:07 EDT 2003

[Bob van der Poel]
> There was a thread a few days ago about populating a dict/list. I've
> got a bit of bottleneck in my program so grabbed the ideas and did a
> timing test....
>
> ----------
>
> import time
> import random
>
> # Test of various methods to add an item to a list in a dict.
>
> def ifelse(key, val):
>   if d.has_key(key):

That's faster as "if key in d". d.has_key needs an extra step to look up the
method named "has_key" in the d object.

>    d[key].append(val)
>   else:
>    d[key]=[val]
>
> def get1(key,val):
>   l=d.get(key) or []
>   l.append(val)
>   d[key]=l
>
> def get2(key,val):
>   l=d.get(key, [])
>   l.append(val)
>   d[key]=l
>
> for p in (ifelse, get1, get2):
>   tm=time.clock()

Ouch.  You're timing lots of things besides what you're trying to time here.

>   d={}

On the iterations after the first, you're timing how long it takes to
garbage-collect the value of d left over from the last iteration.

>   kc=1000

Set kc to a million, and you may find that get2 becomes faster than get1.
The ratio of kc to the hard-coded 100000 two lines below has a great
influence on how likely it is that the key is already in the dict when a
function is called.  Whether or not the key is already present changes the
sequence of Python statements executed in ifelse and get1, and get1 is
faster if the key is already in the dict (than if the key isn't already in
the dict).

>   print "Function: %s Keycount %s" % (p, kc)

You're timing how long it takes to print, which can vary a lot depending on
the display device.

>   for t in range(100000):

You're also timing how long it takes to build a big list, and how long it
takes to build 100000 int objects, and how long it takes to garbage-collect
all that stuff again.

>    key=random.randrange(kc)
>    val=random.randrange(999)

Also timing 200000 lookups of random and of random.randrange, and 200000
calls to random.randrange.  There's no plausible reason to suspect that
dict-setting time is affected by the *values* put in the dict, just the key
distribution, so varying the values doesn't do much except burn time.

Also, because timing does depend on the sequence of key values, for fairness
each method should be judged against the same sequence of key values.

>    p(key,val)

This is the piece you wanted to time <wink>.

>   print "Time", time.clock()-tm
>
> ------------
>
> Before running, my thoughts were that get2() would have been the
> fastest (due to the fact there were less statements) and ifelse()
> would have been the slowest. Surprisingly (to me) ifelse() is the
> fastest() and get2() is the slowest.

Try this instead for a clearer picture (your conclusions may or may not
change, but it runs a lot faster regardless):

import time
import random

# Test of various methods to add an item to a list in a dict.

def ifelse(key, val):
    if key in d:
        d[key].append(val)
    else:
        d[key] = [val]

def getorset(key, val):
    d.setdefault(key, []).append(val)

def get1(key, val):
    l = d.get(key) or []
    l.append(val)
    d[key] = l

def get2(key, val):
    l = d.get(key, [])
    l.append(val)
    d[key] = l

N = 100000
kc = 1000
keys = [random.randrange(kc) for i in xrange(N)]

for p in (ifelse, getorset, get1, get2):
    d = {}
    print "Function: %s Keycount %s" % (p, kc)
    tm = time.clock()
    for k in keys:
        p(k, 999)
    print "Time", time.clock()-tm