Defensive programming

Lulu of the Lotus-Eaters mertz at gnosis.cx
Mon Jun 2 13:28:28 EDT 2003


Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote previously:
|For the same reason everyone who uses Python gets array bounds
|checking whether they ask for it or not.  What makes you think that a
|collision resistant function is necessarily slower?

Scott Crosby, author of the "hash overrun" paper, proposes the UHASH
function as an alternative.  His OWN argument in favor of its speed is
that it becomes faster than Perl's hash technique for strings LONGER
than about 44 bytes.  The exact crossover point seems to depend on
whether the hand-tuned assembly version was used for testing (which he
vacillated on).

Crosby then characterizes UHASH as being "half as fast" as the Perl hash
at smaller sizes.  But even that is somewhat biased averaging.  From his
own chart at:

  http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/LengthEffects.png

For strings around 8-16 bytes, UHASH is quite a bit worse than half as
fast.  It does better at either the very small length, or close to the
cross-over point.  To my mind--albeit without any actual real-world
modelling--strings of about 8-16 bytes are the "sweet spot"... those are
probably the most common lengths for strings in programs (especially
ones received as user input).

|As for "more complicated", why should the user care whether the function
|is complicated as long as the Python implementation takes care of it?

Well...  Uncle Timmy cares about using non-portable and complicated
hashing algorithms.  He's the one that would need to maintain it (or at
least someone at Python Labs).  "Might be nice in a rare border case"
isn't enough to convince the actual Python developers to take on new
headaches.

Moreover, another thing Guido observes on python-dev is that hash() is
now part of Python's API, and part of its documented behavior is
returning the same value across different runs of the same Python
version.  Using a universal hash breaks that API.

|If a collision resistant hash really has to be slower than the regular
|version, then having a runtime switch to select which version to use.

That's exactly what we have right now:

    class InputString(str):
        def __hash__(self):
           return UHASH(self)
    user_input = InputString(user_input)

Yours, Lulu...

--
 mertz@   _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis  _/_/                    Postmodern Enterprises         _/_/  s r
.cx    _/_/  MAKERS OF CHAOS....                              _/_/   i u
      _/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/    g s






More information about the Python-list mailing list