[Python-Dev] Counting collisions for the win

Fri Jan 20 06:24:55 CET 2012

On 1/19/2012 8:54 PM, Carl Meyer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Victor,
>
> On 01/19/2012 05:48 PM, Victor Stinner wrote:
> [snip]
>> Using a randomized hash may
>> also break (indirectly) real applications because the application
>> output is also somehow "randomized". For example, in the Django test
>> suite, the HTML output is different at each run. Web browsers may
>> render the web page differently, or crash, or ... I don't think that
>> Django would like to sort attributes of each HTML tag, just because we
>> wanted to fix a vulnerability.
> I'm a Django core developer, and if it is true that our test-suite has a
> dictionary-ordering dependency that is expressed via HTML attribute
> ordering, I consider that a bug and would like to fix it. I'd be
> grateful for, not resentful of, a change in CPython that revealed the
> bug and prompted us to fix it. (I presume that it is true, as it sounds
> like you experienced it directly; I don't have time to play around at
> the moment, but I'm surprised we haven't seen bug reports about it from
> users of 64-bit Pythons long ago). I can't speak for the core team, but
> I doubt there would be much disagreement on this point: ideally Django
> would run equally well on any implementation of Python, and as far as I
> know none of the alternative implementations guarantee hash or
> dict-ordering compatibility with CPython.
>
> I don't have the expertise to speak otherwise to the alternatives for
> fixing the collisions vulnerability, but I don't believe it's accurate
> to presume that Django would not want to fix a dict-ordering dependency,
> and use that as a justification for one approach over another.
>
> Carl

It might be a good idea to have a way to seed the hash with some value 
to allow testing with different dict orderings -- this would allow tests 
to be developed using one Python implementation that would be immune to 
the different orderings on different implementations; however, 
randomizing the hash not only doesn't solve the problem for long-running 
applications, it causes non-deterministic performance from one run to 
the next even with the exact same data: a different (random) seed could 
cause collisions sporadically with data that usually gave good 
performance results, and there would be little explanation for it, and 
little way to reproduce the problem to report it or understand it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120119/00748991/attachment.html>