[issue34751] Hash collisions for tuples

Tim Peters report at bugs.python.org
Wed Oct 3 17:41:59 EDT 2018


Tim Peters <tim at python.org> added the comment:

Here's a complete xxHash-based implementation via throwing away C++-isms from

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

The 32-bit build fares worse:

- 3 collisions in the old tuple test.
- 51 collisions in the new tuple test.
- 173 collisions across product([-3, 3], repeat=20)
- 141 collisions across product([0.5, 0.25], repeat=20)
- no other collisions

The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3.  So 141 is fine, but 173 is highly suspect.  51 for the new tuple test is way out of bounds.

But I don't much care.  None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases.  All you have to do is keep trying.

Much as with SeaHash, adding

    t ^= t << 1;

at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build.  It also cuts the product([-3, 3], repeat=20) collisions to 109.  But boosts the old tuple test's collisions to 12.  I wouldn't do it:  it adds cycles for no realistic gain.  We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.

BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).

#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
#    define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
#    define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
#    define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
#    define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
#    define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
#    define Py_HASH_LSHIFT 13
#else
#    error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        t = (Py_uhash_t)y;
        x += t * Py_HASH_MULT2;
        x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
        x *= Py_HASH_MULT1;
    }
    x += 97531UL;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________


More information about the Python-bugs-list mailing list