[Python-Dev] Algoritmic Complexity Attack on Python

Scott A Crosby scrosby@cs.rice.edu
30 May 2003 11:18:14 -0500


On Fri, 30 May 2003 07:39:18 -0400, Guido van Rossum <guido@python.org> writes:

> [Tim Peters]
> > > I'm uninterested in trying to "do something" about these.  If
> > > resource-hogging is a serious potential problem in some context, then
> > > resource limitation is an operating system's job, and any use of Python (or
> > > Perl, etc) in such a context should be under the watchful eyes of OS
> > > subsystems that track actual resource usage.
> 
> [Scott Crosby]
> > I disagree. Changing the hash function eliminates these attacks on
> > hash tables.
> 
> At what cost for Python?  99.99% of all Python programs are not
> vulnerable to this kind of attack, because they don't take huge
> amounts of arbitrary input from an untrusted source.  If the hash
> function you propose is even a *teensy* bit slower than the one we've
> got now (and from your description I'm sure it has to be), everybody

We included several benchmarks in our paper:

On here, 
   http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/CacheEffects2.png

when we compare hash functions as the working set changes, we notice
that a single L2 cache miss far exceeds hashing time for all
algorithms.

On 

   http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/LengthEffects.png

UHASH exceeds the performance of perl's hash function, which is
simpler than your own.

Even for small strings, UHASH is only about half the speed of perl's
hash function, and your function already performs a multiplication per
byte:

#define HASH(hi,ho,c)    ho = (1000003*hi) ^ c
#define HASH0(ho,c)      ho = ((c << 7)*1000003) ^ c

The difference between this and CW12 is one 32 bit modulo
operation. (Please note that CW12 is currently broken. Fortunately it
didn't affect the benchmarking on x86.)

> would be paying for the solution to a problem they don't have.  You
> keep insisting that you don't know Python.  Hashing is used an awful
> lot in Python -- as an interpreted language, most variable lookups and
> all method and instance variable lookups use hashing.  So this would
> affect every Python program.

Have you done benchmarking to prove that string_hash is in fact an
important hotspot in the python interpreter? If so, and doing one
modulo operation per string is unacceptable, then you may wish to
consider Jenkin's hash. The linux kernel people are switching to using
a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no
proofs that it is in fact universal. It, however, probably is safe.

It is not unlikely that if you went that route you'd be somewhat
safer, and faster, but if you want full safety, you'd need to go with
a universal hash.

Scott