[Python-Dev] Counting collisions for the win

Fri Jan 20 04:32:13 CET 2012

On Fri, Jan 20, 2012 at 00:48, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> I propose to solve the hash collision vulnerability by counting
> collisions because it does fix the vulnerability with a minor or no
> impact on applications or backward compatibility. I don't see why we
> should use a different fix for Python 3.3. If counting collisons
> solves the issue for stable versions, it is also enough for Python
> 3.3. We now know all issues of the randomized hash solution, and I
> think that there are more drawbacks than advantages. IMO the
> randomized hash is overkill to fix the hash collision issue.

I'd like to point out that an attacker is not limited to sending just
one dict full of colliding keys.  Given a 22ms stall for a dict full
of 1000 colliding keys, and 100 such objects inside a parent object
(perhaps JSON), you can stall a server for 2.2+ seconds.  Going with
the raise-at-1000 approach doesn't solve the problem for everyone.

In addition, because the raise-at-N-collisions approach raises an
exception, everyone who wants to handle this error condition properly
has to change their code to catch a previously-unexpected exception.
(I know they're usually still better off with the fix, but why force
many people to change code when you can actually fix the hashing
problem?)

Another issue is that even with a configurable limit, different
modules can't have their own limits.  One module might want a
relatively safe raise-at-100, and another module creating massive
dicts might want raise-at-1000.  How does a developer know whether
they can raise or lower the limit, given that they use a bunch of
different modules?

I actually went with this stop-at-N-collisions approach by patching my
CPython a few years ago, where I limiting dictobject and setobject's
critical `for` loop to 100 iterations (I realize this might handle
fewer than 100 collisions.)  This worked fine until I tried to compile
PyPy, where the translator blew up due to a massive dict.  This,
combined with the second problem (needing to catch an exception), led
me to abandon this approach and write Securetypes, which has a
securedict that uses SHA-1.  Not that I like this either; I think I'm
happy with the randomize-hash() approach.

Ivan