gather information from various files efficiently

Tue Dec 14 13:14:29 EST 2004

Klaus Neuner wrote:
> The straightforward way to solve this problem is to create a
> dictionary. Like so:
> 
> 
> [...]
> 
> a, b = get_information(line)
> if a in dict.keys():
>     dict[a].append(b)
> else:
>     dict[a] = [b]
> 

So I timed the three suggestions with a few different datasets:

 > cat builddict.py
def askpermission(d, k, v):
     if k in d:
         d[k].append(v)
     else:
         d[k] = [v]

def askforgiveness(d, k, v):
     try:
         d[k].append(v)
     except KeyError:
         d[k] = [v]

def default(d, k, v):
     d.setdefault(k, []).append(v)

def test(items, func):
     d = {}
     for k, v in items:
         func(d, k, v)

Dataset where every value causes a collision:

 > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i 
in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.62 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i 
in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.58 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i 
in range(1000)], bd.default)"
100 loops, best of 3: 2.03 msec per loop

Dataset where no value causes a collision:

 > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i 
in range(1000)], bd.askpermission)"
100 loops, best of 3: 2.29 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i 
in range(1000)], bd.askforgiveness)"
100 loops, best of 3: 9.96 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i 
in range(1000)], bd.default)"
100 loops, best of 3: 2.98 msec per loop

Datset where one of every 5 values causes a collision:

 > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for 
i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.82 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for 
i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.79 msec per loop

 > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for 
i in range(1000)], bd.default)"
100 loops, best of 3: 2.2 msec per loop

So, when there are lots of collisions, you may get a small benefit from 
the try/except solution.  If there are very few collisions, you probably 
would rather the if/else solution.  The setdefault solution patterns 
about like the if/else solution, but is somewhat slower.

I will probably continue to use setdefault, because I think it's 
prettier =) but if you're running into a speed bottleneck in this sort 
of situation, you might consider one of the other solutions.

Steve