[Python-3000] Performance Notes - new hash algorithm

Tim Peters tim.peters at gmail.com
Mon Sep 10 03:32:16 CEST 2007


[Larry Hastings]
> I see--it's avoiding the Birthday Paradox.

It /tends/ to, yes.  This wasn't a design goal of the string hash,
it's just a property observed after it was adopted, and appreciated
much later ;-)

It's much clearer for Python's small-int hash, where hash(i) == i for
i != -1.  That is, nearly all "small enough" integers are their own
"hash codes".  That guarantees no collisions whatsoever in a dict
keyed by a contiguous range of small integers (excluding -1), no
matter how large the range.

Read the comments in dictobject.c for more on this.  The
predictability of such hash schemes has both good & bad implications
for dict performance, and Python's dict conflict-resolution strategies
are fancier than most to mitigate the possible bad implications.

> Collisions are actually more likely if the numbers are totally random than
> if the numbers are, because of a feeble hash algorithm, relatively
> consecutive.  ;)

Right.  In a "good" (cryptographically speaking) hash function, a
1-bit change in the input "should" change about half the output bits
(in the hash code), making collisions much more likely when the keys
differ little in the low bits.

The important point is that the cost of building and accessing
string-keyed dicts is more important (in Python) than the cost of just
hashing strings, and collision resolution is a real expense.


More information about the Python-3000 mailing list