[issue13703] Hash collision security issue

Mon Jan 23 14:07:26 CET 2012

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Dave Malcolm wrote:
> 
> Dave Malcolm <dmalcolm at redhat.com> added the comment:
> 
> On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote:
>> Marc-Andre Lemburg <mal at egenix.com> added the comment:
>>
>> Demo patch implementing the collision limit idea for Python 2.7.
>>
>> ----------
>> Added file: http://bugs.python.org/file24151/hash-attack.patch
>>
> 
> Marc: is this the latest version of your patch?

Yes. As mentioned in the above message, it's just a demo of how
the collision limit idea can be implemented.

> Whether or not we go with collision counting and/or adding a random salt
> to hashes and/or something else, I've had a go at updating your patch
> 
> Although debate on python-dev seems to have turned against the
> collision-counting idea, based on flaws reported by Frank Sievertsen
> http://mail.python.org/pipermail/python-dev/2012-January/115726.html
> it seemed to me to be worth at least adding some test cases to flesh out
> the approach.  Note that the test cases deliberately avoid containing
> "hostile" data.

Martin's example is really just a red herring: it doesn't matter
where the hostile data originates or how it gets into the application.
There are many ways an attacker can get the O(n^2) worst case
timing triggered.

Frank's example is an attack on the second possible way to
trigger the O(n^2) behavior. See msg150724 further above where I
listed the two possibilities:

"""
An attack can be based on trying to find many objects with the same
hash value, or trying to find many objects that, as they get inserted
into a dictionary, very often cause collisions due to the collision
resolution algorithm not finding a free slot.
"""

My demo patch only addresses the first variant. In order to cover
the second variant as well, you'd have to count and limit the
number of iterations in the perturb for-loop of the lookdict()
functions where the hash value of the slot does not match the
key's hash value.

Note that the second variant is both a lot less likely to trigger
(due to the dict getting resized on a regular basis) and the
code involved a lot faster than the code for the first
variant (which requires a costly object comparison), so the
limit for the second variant would have to be somewhat higher
than for the first.

BTW: The collision counting patch chunk for the string dicts in my
demo patch is wrong. I've attached a corrected version. In the
original patch it was counting both collision variants with the
same counter and limit.

----------
Added file: http://bugs.python.org/file24295/hash-attack-2.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________
-------------- next part --------------
Index: Objects/dictobject.c
===================================================================

--- Objects/dictobject.c	(revision 88933)
+++ Objects/dictobject.c	(working copy)
@@ -9,6 +9,8 @@
 
 #include "Python.h"
 
+/* Maximum number of allowed hash collisions. */
+#define Py_MAX_DICT_COLLISIONS 1000
 
 /* Set a key error with the specified argument, wrapping it in a
  * tuple automatically so that tuple keys are not unpacked as the
@@ -327,6 +329,7 @@
     register PyDictEntry *ep;
     register int cmp;
     PyObject *startkey;
+    size_t collisions;
 
     i = (size_t)hash & mask;
     ep = &ep0[i];
@@ -361,6 +364,7 @@
 
     /* In the loop, me_key == dummy is by far (factor of 100s) the
        least likely outcome, so test for that last. */
+    collisions = 1;
     for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
         i = (i << 2) + i + perturb + 1;
         ep = &ep0[i & mask];
@@ -387,6 +391,11 @@
                  */
                 return lookdict(mp, key, hash);
             }
+	    if (++collisions > Py_MAX_DICT_COLLISIONS) {
+		PyErr_SetString(PyExc_KeyError,
+				"too many hash collisions");
+		return NULL;
+	    }
         }
         else if (ep->me_key == dummy && freeslot == NULL)
             freeslot = ep;
@@ -413,6 +422,7 @@
     register size_t mask = (size_t)mp->ma_mask;
     PyDictEntry *ep0 = mp->ma_table;
     register PyDictEntry *ep;
+    size_t collisions;
 
     /* Make sure this function doesn't have to handle non-string keys,
        including subclasses of str; e.g., one reason to subclass
@@ -439,17 +449,24 @@
 
     /* In the loop, me_key == dummy is by far (factor of 100s) the
        least likely outcome, so test for that last. */
+    collisions = 1;
     for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
         i = (i << 2) + i + perturb + 1;
         ep = &ep0[i & mask];
         if (ep->me_key == NULL)
             return freeslot == NULL ? ep : freeslot;
-        if (ep->me_key == key
-            || (ep->me_hash == hash
-            && ep->me_key != dummy
-            && _PyString_Eq(ep->me_key, key)))
+        if (ep->me_key == key)
             return ep;
-        if (ep->me_key == dummy && freeslot == NULL)
+        if (ep->me_hash == hash && ep->me_key != dummy) {
+	    if (_PyString_Eq(ep->me_key, key))
+		return ep;
+	    if (++collisions > Py_MAX_DICT_COLLISIONS) {
+		PyErr_SetString(PyExc_KeyError,
+				"too many hash collisions");
+		return NULL;
+	    }
+	}
+        else if (ep->me_key == dummy && freeslot == NULL)
             freeslot = ep;
     }
     assert(0);          /* NOT REACHED */