[Python-Dev] String hash function multiplier

Wed Apr 14 23:13:45 EDT 2004

[Raymond]
> On the P4, the documented latency went up from 4 cycles to 14 cycles
> while shifts and adds went down to 0.5 cycles and 1 cycle respectively.
> Timings confirm the result.

What makes the current stepping of some flavor of P4 overwhelmingly
important <0.3 wink>?  Tradeoffs at this level change with every chip
generation, and within a generation vary across vendors.  In particular,
they can make int multiply as fast as they want just by throwing more highly
regular silicon at it (which is why its relative performance varies so much
across chip generations, and is usually worst in the earliest iterations of
a new generation).

My overwhelming concern isn't micro-efficiency of the generated code for the
string-hash loop, it's that the resulting hash have high quality.  I don't
know that the "little prime" is vulnerable to bad cases in non-contrived
real life, but we have years of evidence suggesting that the "big prime"
isn't.  If we can develop evidence that the little prime doesn't either,
great, then I lose all objections to using it.

> It looks like the best bet is to try to speedup the code without
> changing the multiplier.  Intel's software optimization cookbook
> recommends a partial unrolling and elimination of data dependencies so
> that a second multiply can start 4 cycles after the previous one
> started.  If practice bears out the theory, the timings could show a
> three or fourfold speedup without changing the multiplier.

Or leave it alone.  Hand unrolling a loop all but guarantees that the next
generation of compiler optimizer won't be able to reverse-engineer the
original "natural" loop structure, and so won't be able to do the code
transformation that works best for the next generation of hardware tricks.
IOW, once you start down the path of hand-optimizing for a specific compiler
+ HW combo, there are two outcomes:  (1) someone makes this their job
forever after; or, (2) the performance of the code actually gets worse over
time (compared to what it would have been if the original code had been left
dirt simple).

I happen to have an app (spambayes) that makes very heavy use of
string-keyed dicts, and where interning is impractical (so the string hashes
can't be avoided).  Still, speed up string hashing by a factor of 10, and it
won't make enough difference in that app's throughput to mention.  I also
have a hyper-threaded box today, and the imul stalls in this P4 give great
opportunities for other processes to get work done <0.9 wink>.