hashing strings to integers

Adam Funk a24061 at ducksburg.com
Tue May 27 11:13:46 EDT 2014


On 2014-05-23, Chris Angelico wrote:

> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>> I've also used hashes of strings for other things involving
>> deduplication or fast lookups (because integer equality is faster than
>> string equality).  I guess if it's just for deduplication, though, a
>> set of byte arrays is as good as a set of int?
>
> Want a better way to do that? Use a set or dict and let Python do the
> hashing. It'll be every bit as fast as explicit hashing, plus you get
> the bonus of simplicity.

Well, here's the way it works in my mind:

   I can store a set of a zillion strings (or a dict with a zillion
   string keys), but every time I test "if new_string in
   seen_strings", the computer hashes the new_string using some kind
   of "short hash", checks the set for matching buckets (I'm assuming
   this is how python tests set membership --- is that right?), then
   checks any hash-matches for string equality.  Testing string
   equality is slower than integer equality, and strings (unless they
   are really short) take up a lot more memory than long integers.

So for that kind of thing, I tend to store MD5 hashes or something
similar.  Is that wrong?


-- 
With the breakdown of the medieval system, the gods of chaos, lunacy,
and bad taste gained ascendancy.
                                                --- Ignatius J Reilly



More information about the Python-list mailing list