[Python-Dev] Status of the fix for the hash collision vulnerability

Tue Jan 17 12:55:02 CET 2012

> I thought that the original problem was that with N insertions in the
> dictionary, by repeatedly inserting different keys generating the same
> hash value an attacker could arrange that the cost of finding an open
> slot is O(N), and thus the cost of N insertions is O(N^2).
>
> If so, frequent resizing could make the attacker's problem much more
> difficult, as the distribution of secondary probes should change with
> each resize.

The attack creates 60,000 strings (or more) with exactly the same hash
value. A dictionary uses hash(str) & DICT_MASK to compute the bucket
index, where DICT_HASH is the number of buckets minus one. If all
strings have the same hash value, we always start in the same bucket
and the key has to be compared to all previous strings to find the
next empty bucket. The attack works because a LOT of strings are
compared and comparing strings is slow.

If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but
hash(str1)!=hash(str2), strings are not compared (this is a common
optimization in Python), and the so the attack would not be successful
(it would be slow, but not as slow as comparing two strings).

Victor