[Python-Dev] Hash collision security issue (now public)

Sat Dec 31 03:22:24 CET 2011

Le 29/12/2011 02:28, Michael Foord a écrit :
> A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included:
>
> 	 http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf

This PDF doesn't explain exactly the problem and how it can be solved. 
Let's try to summarize this "vulnerability".

The creation of a Python dictionary has a complexity of O(n) in most 
cases, but O(n^2) in the *worst* case. The attack tries to go into the 
worst case. It requires to compute a set of N keys having the same hash 
value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to 
compute these keys once. It looks like it is now cheap enough in 
practice to compute this dataset for Python (and other languages).

A countermeasure would be to check that we don't have more than X keys 
with the same hash value. But in practice, we don't know in advance how 
data are processed, and there are too many input vectors in various formats.

If we want to fix something, it should be done in the implementation of 
the dict type or in the hash algorithm. We can implement dict 
differently to avoid this issue, using a binary tree for example. 
Because dict is a fundamental type in Python, I don't think that we can 
change its implementation (without breaking backward compatibility and 
so applications in production). A possibility would be to add a *new* 
type, but all libraries and applications would need to be changed to fix 
the vulnerability.

The last choice is to change the hash algorithm. The *idea* is the same 
than adding salt to hashed password (in practice it will be a little bit 
different): if a pseudo-random salt is added, the attacker cannot 
prepare a single dataset, he/she will have to regenerate a new dataset 
for each possible salt value. If the salt is big enough (size in bits), 
the attacker will need too much CPU to generate the dataset (compute N 
keys with the same hash value). Basically, it slows down the attack by 
2^(size of the salt).

--

Another possibility would be to replace our fast hash function by a 
better hash function like MD5 or SHA1 (so the creation of the dataset 
would be too slow in practice = too expensive), but cryptographic hash 
functions are much slower (and so would slow down Python too much).

Limiting the size of the POST data doesn't solve the problem because 
there are many other input vectors and data formats. It may block the 
most simple attacks because the attacker cannot inject enough keys to 
slow down your CPU.

Victor