[issue13703] Hash collision security issue

Fri Jan 20 02:11:24 CET 2012

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

> Since the hash function is known, it doesn't make things much
> harder. Without suffix you just need hash('') to find out what
> the prefix is. With suffix, two values are enough.

With my patch, hash('') always return zero. I don't remember who asked
me to do that, but it avoids to leak too easily the secret :-) I wrote
some info how to compute the secret:
http://bugs.python.org/issue13703#msg150706

I don't see how to compute the secret, but it doesn't mean that it is
impossible :-) I suppose that you have to brute force some bits, at
least if you only have repr(dict) which gives only (indirectly) the
lower bits of the hash.

> (things obviously get tricky once overflow kicks in)

hash() doesn't overflow: if you know the string, you can run the
algorithm backward. To divide, you can compute 1/1000003 mod 2^32 (or
mod 2^64): 2021759595 and 16109806864799210091. So x/1000003 mod 2^32
= x*2021759595 mod 2^32.

See my invert_mod() function of:
https://bitbucket.org/haypo/misc/src/tip/python/mathfunc.py

> With Victor's approach hash(0) would output the whole seed,
> but even if the seed is not known, creating an attack data
> set is trivial, since hash(x) = P ^ x ^ S.

I suppose that it would be too simple to compute the secret of a
randomized integer hash, so it is maybe better to leave them
unchanged. Using a different secret from strings and integer would not
protect Python against an attack only using integers, but integer keys
are less common than string keys (especially on web applications).

Anyway, I changed my mind about randomized hash: I now prefer counting
collisions :-)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________