[Python-Dev] Counting collisions for the win

Fri Jan 20 19:52:44 CET 2012

Am 20.01.2012 19:15, schrieb Guido van Rossum:

>     OTOH, if you change dictionary order and *that* breaks the application, then
>     the bugs submitted to the application's tracker will be legitimate bugs that
>     have to be fixed even if nothing else changed.
> 
> 
> There are lots of things that are undefined according to the language spec (and
> quite possibly known to vary between versions or platforms or implementations
> like PyPy or Jython) but which we would never change in a bugfix release.
> 
>     So I still think we should ditch the paranoia about dictionary order changing,
>     and fix this without counting.  A little bit of paranoia could creep back in
>     by disabling the hash fix by default in stable releases, but I think it would
>     be fine to make that a compile-time option.
> 
> 
> I'm sorry, but I don't want to break a user's app with a bugfix release and say
> "haha your code was already broken you just didn't know it".
> 
> Sure, the dict order already varies across Python implementations, possibly
> across 32/64 bits or operating systems. But many organizations (I know a few :-)
> have a very large installed software base, created over many years by many
> people with varying skills, that is kept working in part by very carefully
> keeping the environment as constant as possible. This means that the target
> environment is much more predictable than it is for the typical piece of open
> source software.

I agree.  This applies to 3.2 and 2.7, but even more to 3.1 and 2.6, which are
in security-fix mode.

Even if relying on dict order is a bug right now, I believe it happens many
times more often in code bases out there than dicts that are filled with many
many colliding keys.  So even if we can honestly blame the programmer in the
former case, the users applying the security fix will have the same bad
experience and won't likely care if we claim "undefined behavior".  This means
that it seems preferable to go with the situation where you have less breakages
in total.

Not to mention that changing dict order is likely to lead to much more subtle
bugs than a straight MemoryError on a dict insert.

Also, another advantage of collision counting I haven't seen in the discussion
yet is that it's quite trivial to provide an API in sys to turn it on or off --
while turning on or off randomized hashes has to be done before Python starts
up, i.e. at build time or with an environment variable or flag.

Georg