[Python-Dev] Hash collision security issue (now public)

Mon Jan 2 00:43:52 CET 2012

> But how about precomputing the intermediate value (x)? The hash is (mostly)
> doing x = f(x, c) for each c in the input.

That's a fair point. If we go down that avenue, I think simply
choosing a random fixed starting value for x is the correct choice,
rather than computing an intermediate value.

> I sort of see your point, but I still think that if we could add as little
> per-character overhead as possible it would be best -- sometimes people *do*
> hash very long strings.

Yep, agreed. My original proposal did not adequately address this.

>> Another option to consider would be to apply this change to some but
>> not all of the rounds. If we added the seed lookup xor operation for
>> only the first and last 5 values of x, we would still retain much of
>> the benefit without adding much computational overhead for very long
>> strings.
>
> I like that.

I believe this is a reasonable solution. An attacker could still
manipulate the internal state of long strings, but the additional
information at both ends should make that difficult to exploit. I'll
talk it over with the reviewers.

>> We could also consider a less computationally expensive operation than
>> the modulo for calculating the lookup index, like simply truncating to
>> the correct number of bits.
>
> Sure. Thanks for thinking about all the details here!!

Again, I'll talk to the reviewers (and run the randomness test
battery) to be double-check that this doesn't affect the distribution
in some unexpected way, but I think it should be fine.

> PS. Is the collision-generator used in the attack code open source?

Not in working form, and they've turned down requests for it from
other projects that want to check their work. If it's imperative that
we have one, I can write one, but I'd rather not spend the effort if
we don't need it.

-Paul