[Python-Dev] Hash collision security issue (now public)

Sun Jan 1 04:29:59 CET 2012

> I'm not too concerned about a 3rd party being able to guess the random seed
> -- this would require much more effort on their part, since they would have
> to generate a new set of colliding keys each time they think they have
> guessed the hash

This is incorrect. Once an attacker has guessed the random seed, any
operation which reveals the ordering of hashed objects can be used to
verify the answer. JSON responses would be ideal. In fact, an attacker
can do a brute-force attack of the random seed offline. Once they have
the seed, generating collisions is a fast process.

The goal isn't perfection, but we need to do better than a simple
salt. I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5

This code is based on PyPy's implementation, but the concept is
universal. Rather than choosing a single short random seed per
process, we generate a much larger random seed (r). As we hash, we
deterministically choose a portion of that seed and incorporate it
into the hash process. This modification is a minimally intrusive
change to the existing hash function, and so should not introduce
unexpected side effects which might come from switching to a different
class of hash functions.

I've worked through this code with Alex Gaynor, Antoine Pitrou, and
Victor Stinner, and have asked several mathematicians and security
experts to review the concept. The reviewers who have gotten back to
me thus far have agreed that if the initial random seed is not flawed,
this should not overly change the properties of the hash function, but
should make it quite difficult for an attacker to deduce the necessary
information to predictably cause hash collisions. This function is not
designed to protect against timing attacks, but should be nontrivial
to reverse even with access to timing data.

Empirical testing shows that this unoptimized python implementation
produces ~10% slowdown in the hashing of ~20 character strings. This
is probably an acceptable trade off, and actually provides better
performance in the case of short strings than a high-entropy
fixed-length seed prefix.

-Paul