[Python-Dev] Priority queue (binary heap) python code

Tim Peters tim@zope.com
Thu, 27 Jun 2002 12:19:09 -0400


[David Abrahams]
> Noticing that also left me with a question: how
> come everybody in the world hasn't stolen as much as
> possible from the Python hashing implementation?
> Are there a billion such 10-years'-tweaked
> implementations lying around which all perform
> comparably well?

[Gordon McMillan]
> Jean-Claude Wippler and Christian Tismer did some
> benchmarks against other implementations. IIRC, the
> only one in the same ballpark was Lua's (which, IIRC,
> was faster at least under some conditions).

I'd like to see the benchmark.  Like Python, Lua uses a power-of-2 table
size, but unlike Python uses linked lists for collisions instead of open
addressing.  This appears to leave it very vulnerable to bad cases (like
using

     [i << 16 for i in range(20000)]

as a set of keys -- Python and Lua both grab the last 15 bits of the ints as
their hash codes, which means every key maps to the same hash bucket.  Looks
like Lua would chain them all together.  Python breaks the ties quickly via
its collision resolution scrambling.).

The Lua string hash appears systematically vulnerable:

static unsigned long hash_s (const char *s, size_t l) {
  unsigned long h = l;  /* seed */
  size_t step = (l>>5)|1;  /* if string is too long, don't hash all its
chars */
  for (; l>=step; l-=step)
    h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++));
  return h;
}

That hash function would be weak even if it didn't ignore up to 97% of the
input characters.  OTOH, if it happens not to collide, ignoring up to 97% of
the characters eliminates up to 97% of the expense of computing a hash.

Etc.

Lua's hashes do appear to get a major benefit from lacking a Python feature:
user-defined comparisons can (a) raise exceptions, and (b) mutate the hash
table *while* you're looking for a key in it.  Those cause the Python
implementation lots of expensive pain (indeed, the main reason Python has a
distinct lookup function for string-keyed dicts is that it doesn't have to
choke itself worrying about #a or #b for builtin strings).

There's a lovely irony here.  Python's dicts are fast because they've been
optimized to death.  When Lua's dicts are fast, it seems more the case it's
because they don't worry much about bad cases.  That's *supposed* to be
Python's trick <wink>.