[Python-Dev] Dictionary tuning

Wed, 30 Apr 2003 22:13:46 -0400

[Raymond Hettinger]
> ...
> I worked on similar approaches last month and found them wanting.
> The concept was that a 64byte cache line held 5.3 dict entries and
> that probing those was much less expensive than making a random
> probe into memory outside of the cache.
>
> The first thing I learned was that the random probes were necessary
> to reduce collisions.  Checking the adjacent space is like a single
> step of linear chaining, it increases the number of collisions.

Yes, I believe that any regularity will.

> That would be fine if the cost were offset by decreased memory
> access time; however, for small dicts, the whole dict is already
> in cache and having more collisions degrades performance
> with no compensating gain.
>
> The next bright idea was to have a separate lookup function for
> small dicts and for larger dictionaries.  I set the large dict lookup
> to search adjacent entries.  The good news is that an artificial
> test of big dicts showed a substantial improvement (around 25%).
> The bad news is that real programs were worse-off than before.

You should qualify that to "some real programs", or perhaps "all real
programs I've tried".  On the other side, I have real programs that access
large dicts in random order, so if you tossed those into your mix, a 25%
gain on those would more than wipe out the 1-2% losses you saw elsewhere.

> A day of investigation showed the cause.  The artificial test
> accessed keys randomly and showed the anticipated benefit. However,
> real programs access some keys more frequently than others
> (I believe Zipf's law applies.)

Some real programs do, and, for all I know, most real programs.  It's not
the case that all real programs do.  The counterexamples that sprang
instantly to my mind are those using dicts to accumulate stats for testing
random number generators.  Those have predictable access patterns only when
the RNG they're testing sucks <wink>.

> Those keys *and* their collision chains are likely already in the cache.
> So, big dicts had the same limitation as small dicts:  You always lose
> when you accept more collisions in return for exploiting cache locality.

Apart from that "always" ain't always so, I really like that as a summary!

> The conclusion was clear, the best way to gain performance
> was to have fewer collisions in the first place.  Hence, I
> resumed experiments on sparsification.

How many collisions are you seeing?  For int-keyed dicts, all experiments I
ran said Python's dicts collided less than a perfectly random hash table
would collide (the reason for that is explained in dictobject.c, and that
int-keyed dicts tend to use a contiguous range of ints as keys).

For string-keyed dicts, extensive experiments said collision behavior was
indistinguishable from a perfectly random hash table.

I never cared enough about other kinds of keys to time 'em, at least not
since systematic flaws were fixed in the tuple and float hash functions
(e.g., the tuple hash function used to xor the tuple's elements' hash codes,
so that all permututions of a given tuple had the same hash code; that's
necessary for unordered sets, but tuples are ordered).

>> If someone wants to experiment with that in lookdict_string(),
>> stick a new
>>
>>     ++i;
>>
>> before the for loop, and move the existing
>>
>> i = (i << 2) + i + perturb + 1;
>>
>> to the bottom of that loop.  Likewise for lookdict().

> PyStone gains 1%.
> PyBench loses a 1%.
> timecell gains 2%     (spreadsheet benchmark)
> timemat loses 2%     (pure python matrix package benchmark)
> timepuzzle loses 1% (class based graph traverser)

You'll forgive me if I'm skeptical:  they're such small differences that, if
I saw them, I'd consider them to be a wash -- in the noise.  What kind of
platform are you running on that has timing behavior repeatable enough to
believe 1-2% blips?

> P.S.  There is one other way to improve cache behavior
> but it involves touching code throughout dictobject.c.

Heh -- that wouldn't even be considered a minor nuisance to the truly
obsessed <wink>.

> Move the entry values into a separate array from the
> key/hash pairs.  That way, you get 8 entries per cache line.

What use case would that help?  You said before that small dicts are all in
cache anyway, so it wouldn't help those.  The jumps in large dicts are so
extreme that it doesn't seem to matter if the cache line size du jour holds
1 slot or 100.  To the contrary, at the end of the large dict lookup, it
sounds like it would incur an additional cache miss to fetch the value after
the key was found (since that value would no longer ever ride along with the
key and hashcode).

I can think of a different reason for considering this:  sets have no use
for the value slot, and wouldn't need to allocate space for 'em.

> P.P.S.  One other idea is to use a different search pattern
> for small dictionaries.  Store entries in a self-organizing list
> with no holes.  Dummy fields aren't needed which saves
> a test in the linear search loop.  When an entry is found,
> move it one closer to the head of the list so that the most
> common entries get found instantly.

I don't see how more than just one can be found instantly; if "instantly"
here means "in no more than a few tries", that's usually true of dicts
too -- but is still an odd meaning for "instantly" <wink>.

> Since there are no holes, all eight cells can be used instead of the
> current maximum of five.  Like the current arrangement, the
> whole small dict fits into just two cache lines.

Neil Schemenauer suggested that a few years ago, but I don't recall it going
anywhere.  I don't know how to predict for which apps it would be faster.
If people are keen to optimize very small search tables, think about schemes
that avoid the expense of hashing too.