[Python-Dev] Status of the fix for the hash collision vulnerability

Fri Jan 13 10:23:45 CET 2012

> Unfortunately it requires only a few seconds to compute enough 32bit
> collisions on one core with no precomputed data.

Are you running the hash function "backward" to generate strings with
the same value, or you are more trying something like brute forcing?

And how do you get the hash secret? You need it to run an attack.

> In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^
> suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is
> possible.

My change adds also a prefix (a prefix and a suffix). I don't know if
it changes anything for generating collisions.

> So the question is: How difficult is it to guess the seed?

I wrote some remarks about that in the issue. For example:

(hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) &
HASH_MASK) ^ ((prefix * 1000003**2)  & HASH_MASK)

I suppose that you don't have directly the full output of hash(str) in
practical, but hash(str) & DICT_MASK where DICT_MASK depends is the
size of the internal dict array minus 1. For example, for a dictionary
of 65,536 items, the mask is 0x1ffff and so cannot gives you more than
17 bits of hash(str) output. I still don't know how difficult it is to
retreive hash(str) bits from repr(dict).

Victor