hashing strings to integers

Tue May 27 13:02:50 EDT 2014

On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote:

> On 2014-05-23, Chris Angelico wrote:
> 
>> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061 at ducksburg.com>
>> wrote:
>>> I've also used hashes of strings for other things involving
>>> deduplication or fast lookups (because integer equality is faster than
>>> string equality).  I guess if it's just for deduplication, though, a
>>> set of byte arrays is as good as a set of int?
>>
>> Want a better way to do that? Use a set or dict and let Python do the
>> hashing. It'll be every bit as fast as explicit hashing, plus you get
>> the bonus of simplicity.
> 
> Well, here's the way it works in my mind:
> 
>    I can store a set of a zillion strings (or a dict with a zillion
>    string keys), but every time I test "if new_string in seen_strings",
>    the computer hashes the new_string using some kind of "short hash",
>    checks the set for matching buckets (I'm assuming this is how python
>    tests set membership --- is that right?), 

So far so good. That applies to all objects, not just strings.

>    then checks any
>    hash-matches for string equality.  Testing string equality is slower
>    than integer equality, and strings (unless they are really short)
>    take up a lot more memory than long integers.

But presumably you have to keep the string around anyway. It's going to 
be somewhere, you can't just throw the string away and garbage collect 
it. The dict doesn't store a copy of the string, it stores a reference to 
it, and extra references don't cost much.

As for string equality, in pseudo-code it looks something like this:

    def __eq__(self, other):
        # Consider only the case where other is a string as well.
        if self is other:  # object identity
            return True
        if len(self) != len(other):  # checking the length is fast
            return False
        # iterate over both strings in lock-step
        for a in self, b in other:
            if a != b: return False
        return True

only written in fast C code. So the equality test bails out as quickly as 
possible, and the only time you have to look at every character is if the 
two strings are equal but not the same object.

MD5 hashing, on the other hand, *always* has to look at every character, 
and perform quite a complicated set of computations. It's expensive.

> So for that kind of thing, I tend to store MD5 hashes or something
> similar.  Is that wrong?

First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.

You're making the cardinal mistake of optimization, not what is slow, but 
what you *assume* will be slow. (Or, replace memory consumption for speed 
if you prefer.) Python is not C, and your intuitions for what's fast and 
slow (low and high memory consumption) are likely to be way off. Consider 
trying to store an MD5 hash instead of the string "Hello World!". If I 
try this in Python 2.7, I get this:

py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30

I save a lousy 3 bytes. But in order to save that 3 bytes, I have to 
generate an md5 object (32 bytes) and a hex string (53 bytes), both of 
which take time to create and then time to garbage collect.

Actually, I don't even save those 3 bytes, since for nearly all 
reasonable applications, I'll need to keep the string around somewhere. 
So instead of saving three bytes, it's costing me thirty bytes for the 
hash, plus who-knows-what needed to manage the book keeping to link that 
hash to the actual string.

Ah, but maybe we can save time hashing them?

py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191

Times are in seconds for one million iterations of the code being timed. 
Or alternatively, it costs about 0.14 microseconds to hash the string, 
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate 
the MD5 hash in the first place, that takes *much* longer:

py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188

but merely to perform a lightweight hash on the long int so as to find 
which bucket of the dict it belongs to.

So, yes, chances are **very** good that you're supposed optimizations in 
this area are actually pessimizations. If you have not measured where the 
bottlenecks in your code are, you have no idea which parts of the code 
need to be optimized.

Now, it is conceivable that there may be some circumstances where your 
strategy makes sense. I'm guessing you would need something like this 
before you even come close to saving memory and/or time:

- rather than short keys like "Hello World!", you are dealing with 
  keys that are typically many tens of thousands of characters long;

- most of the strings are the same length and have very similar 
  prefixes (e.g. the first 3000 chars are the same);

- rather than "zillions" of them, there are few enough of them that
  the chances of an MD5 collision is insignificant;

  (Any MD5 collision is going to play havoc with your strategy of
  using hashes as a proxy for the real string.)

- and you can arrange matters so that you never need to MD5 hash a 
  string twice.

Is my guess closer to the actuality than yours? I don't know. I haven't 
measured it either. But I know that Python is a high-level language with 
lots of high-level data structures like dicts which trade-off time and 
memory for programmer convenience, and that I'd want to see some real 
benchmarks proving that my application was too slow before giving up that 
convenience with a complicated strategy like this.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/