[Python-Dev] Tuning Python dicts

Sat Apr 10 20:06:59 CEST 2010

Hey folks,

I was looking at tuning Python dicts for a data structures class final
project.  I've looked through Object/dictnotes.txt, and obviously
there's already a large body of work on this topic.  My idea was to
alter dict collision resolution as described in the hopscotch hashing
paper[1].  I think the PDF I have came from behind a pay-wall, so I
can't find a link to the original paper.

[1] http://en.wikipedia.org/wiki/Hopscotch_hashing

Just to be clear, this is an experiment I'm doing for a class.  If it
is successful, which I think is unlikely since Python dicts are
already well-tuned, I might consider trying to contribute it back to
CPython over the summer.

The basic idea of hopscotch hashing is to use linear probing with a
cutoff (H), but instead of rehashing when the probe fails, find the
next empty space in the table and move it into the neighborhood of the
original hash index.  This means you have to spend potentially a lot
of extra time during insertion, but it makes lookups very fast because
H is usually chosen such that the entire probe spans at most two cache
lines.  This is much better than the current random (what's the right
name for the current approach?) probing solution, which does
potentially a handful of random accesses into the table.

Looking at dictnotes.txt, I can see that people have experimented with
taking advantage of cache locality.  I was wondering what benchmarks
were used to glean these lessons before I write my own.  Python
obviously has very particular workloads that need to be modeled
appropriately, such as namespaces and **kwargs dicts.

Any other advice would also be helpful.

Thanks,
Reid

One caveat I need to work out:  If more than H items collide into a
single bucket, then you need to rehash.  However, if you have a
particularly evil hash function which always returns zero, no matter
how much you rehash, you will never be able to fit all the items into
the first H buckets.  This would cause an infinite loop, while I
believe the current solution will simply have terrible performance.
IMO the solution is just to increase H for the table if the rehash
fails, but realistically, this will never happen unless the programmer
is being evil.  I'd probably skip this detail for the class
implementation.