[Python-Dev] Hash collision security issue (now public)

Sun Jan 1 16:09:54 CET 2012

On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <paul at mcmillan.ws> wrote:

> > Still, it would represent an effort for the attacker of a much greater
> > magnitude than the current attack. It's all a trade-off -- at some point
> > it'll just be easier for the attacker to use some other vulnerability.
> Also
> > the class of vulnerable servers would be greatly reduced.
>
> I agree that doing anything is better than doing nothing. If we use
> the earlier suggestion and prepend everything with a fixed-length
> seed, we need quite a bit of entropy (and so a fairly long string) to
> make a lasting difference.
>

Ah, but the effect of that long string is summarized in a single (32- or
64-bit) integer.

>  > I'm not sure I understand this. What's the worry about "a different
> class of
> > hash functions"? (It may be clear that I do not have a deep mathematical
> > understanding of hash functions.)
>
> This was mostly in reference to earlier suggestions of switching to
> cityhash, or using btrees, or other more invasive changes. Python 2.X
> is pretty stable and making large changes like that to the codebase
> can have unpredictable effects.

Agreed.

> We know that the current hash function
> works well (except for this specific problem), so it seems like the
> best fix will be as minimal a modification as possible, to avoid
> introducing bugs.
>

Yup.

> > I forget -- what do we do on systems without urandom()? (E.g. Windows?)
> Windows has CryptGenRandom which is approximately equivalent.
>
> > Let's worry about timing attacks another time okay?
> Agreed. As long as there isn't a gaping hole, I'm fine with that.
>
> > Hm. I'm not sure I like the idea of extra arithmetic for every character
> > being hashed.
>
> From a performance standpoint, this may still be better than adding 8
> or 10 characters to every single hash operation, since most hashes are
> over short strings.

But how about precomputing the intermediate value (x)? The hash is (mostly)
doing x = f(x, c) for each c in the input.

It is important that this function touches every
> character - if it only interacts with a subset of them, an attacker
> can fix that subset and vary the rest.
>

I sort of see your point, but I still think that if we could add as little
per-character overhead as possible it would be best -- sometimes people
*do* hash very long strings.

> > But I like the idea of a bigger random seed from which we
> > deterministically pick some part.
>
> Yeah. This makes it much harder to attack, since it very solidly
> places the attacker outside the realm of "just brute force the key".
>
> > How about just initializing x to some
> > subsequence of the seed determined by e.g. the length of the hashed
> string
> > plus a few characters from it?
>
> We did consider this, and if performance is absolutely the prime
> directive, this (or a variant) may be the best option. Unfortunately,
> the collision generator doesn't necessarily vary the length of the
> string. Additionally, if we don't vary based on all the letters in the
> string, an attacker can fix the characters that we do use and generate
> colliding strings around them.
>

Still, much more work for the attacker.

> Another option to consider would be to apply this change to some but
> not all of the rounds. If we added the seed lookup xor operation for
> only the first and last 5 values of x, we would still retain much of
> the benefit without adding much computational overhead for very long
> strings.
>

I like that.

> We could also consider a less computationally expensive operation than
> the modulo for calculating the lookup index, like simply truncating to
> the correct number of bits.
>

Sure. Thanks for thinking about all the details here!!

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120101/2efa5eef/attachment.html>