[issue34751] Hash collisions for tuples

Tim Peters report at bugs.python.org
Fri Sep 21 00:46:25 EDT 2018


Tim Peters <tim at python.org> added the comment:

@jdemeyer, please define exactly what you mean by "Bernstein hash".  Bernstein has authored many hashes, and none on his current hash page could possibly be called "simple":

    https://cr.yp.to/hash.html

If you're talking about the teensy 

    hash = hash * 33 + c;

thing from now-ancient C folklore, Bernstein is claimed to have replaced addition with xor in that later:

    http://www.cse.yorku.ca/~oz/hash.html

In any case, Python _used_ to do plain

    x = (1000003 * x) ^ y;
    
but changed to the heart of the current scheme 14 years ago to address Source Forge bug #942952:
    https://github.com/python/cpython/commit/41bd02256f5a2348d2de3d6e5fdfcaeb2fcaaebc#diff-cde4fb3109c243b2c2badb10eeab7fcd

The Source Forge bug reports no longer appear to exist.  If you care enough, dig through the messages with subject:

    [ python-Bugs-942952 ] Weakness in tuple hash

here:

    https://mail.python.org/pipermail/python-bugs-list/2004-May/subject.html

The thrust was that the simpler approach caused, in real life, hashes of nested tuples of the form

    (X, (X, Y))

to return hash(Y) - the hash(X) parts across calls magically xor'ed out.

It was also systematically the case that

   (X, (Y, Z))

and

   (Y, (X, Z))

hashed to the same thing.

So the complications were added for a reason, to address the showed-up-in-real-life pathological cases discussed in the bug report already linked to.  Since I don't believe we've had other reports of real-life pathologies since, nobody is going to change this again without truly compelling reasons.  Which I haven't seen here, so I'm closing this report again.

BTW, as also explained in that report, the goal of Python's hash functions is to minimize collisions in dicts in real life.  We generally couldn't care less whether they "act random", or are hard to out-guess.  For example,

     assert all(hash(i) == i for i in range(1000000))

has always succeeded.

----------
resolution:  -> rejected
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________


More information about the Python-bugs-list mailing list