[Python-Dev] Algoritmic Complexity Attack on Python

Fri, 30 May 2003 19:07:38 -0400

[Scott Crosby]
> ...
> Have you done benchmarking to prove that string_hash is in fact an
> important hotspot in the python interpreter?

It depends on the specific application, of course.  The overall speed of
dicts is crucial, but in many "namespace" dict apps the strings are
interned, implying (among other things) that a hash code is computed only
once per string.  In those apps the speed of the string hash function isn't
so important.  Overall, though, I'd welcome a faster string hash, and I
agree that Python's isn't particularly zippy.

OTOH, it's just a couple lines of C that run fine on everything from Palm
Pilots to Crays, and have created no portability or maintenance headaches.
Browsing your site, you've got over 100KB of snaky C code to implement
hashing functions, some with known bugs, others with cautions about open
questions wrt platforms with different endianness and word sizes than the
code was initially tested on.  Compared to what Python is using now, that's
a maintenance nightmare.

Note that Python's hash API doesn't return 32 bits, it returns a hash code
of the same size as the native C long.  The multiplication gimmick doesn't
require any pain to do that.

Other points that arise in practical deployment:

+ Python dicts can be indexed by many kinds of immutable objects.
  Strings are just one kind of key, and Python has many hash functions.

+ If I understand what you're selling, the hash code of a given string
  will almost certainly change across program runs.  That's a very
  visible change in semantics, since hash() is a builtin Python
  function available to user code.  Some programs use hash codes to
  index into persistent (file- or database- based) data structures, and
  such code would plain break if the hash code of a string changed
  from one run to the next.  I expect the user-visible hash() would have
  to continue using a predictable function.

+ Some Python apps run for months, and universal hashing doesn't remove
  the possibility of quadratic-time behavior.  If I can poke at a
  long-running app and observe its behavior, over time I can deduce a
  set of keys that collide badly for any hashing scheme fixed when
  the program starts.  In that sense I don't believe this gimmick
  wholly plugs the hole it's trying to plug.

> If so, and doing one modulo operation per string is unacceptable,

If it's mod by a prime, probably.  Some architectures Python runs on require
hundreds of cycles to do an integer mod, and we don't have the resources to
construct custom mod-by-an-int shortcut code for dozens of obscure
architectures.

> then you may wish to consider Jenkin's hash. The linux kernel people
> are switching to using a keyed veriant of Jenkin's hash. However,
> Jenkin's, AFAIK, has no proofs that it is in fact universal. It,
> however, probably is safe.

Nobody writing a Python program *has* to use a dict.  That dicts have
quadratic-time worst-case behavior isn't hidden, and there's no cure for
that short of switching to a data structure with better worst-case bounds.
I certainly agree it's something for programmers to be aware of.  I still
don't see any reason for the core language to care about this, though.