hashing strings to integers
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue May 27 13:02:50 EDT 2014
On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote:
> On 2014-05-23, Chris Angelico wrote:
>
>> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061 at ducksburg.com>
>> wrote:
>>> I've also used hashes of strings for other things involving
>>> deduplication or fast lookups (because integer equality is faster than
>>> string equality). I guess if it's just for deduplication, though, a
>>> set of byte arrays is as good as a set of int?
>>
>> Want a better way to do that? Use a set or dict and let Python do the
>> hashing. It'll be every bit as fast as explicit hashing, plus you get
>> the bonus of simplicity.
>
> Well, here's the way it works in my mind:
>
> I can store a set of a zillion strings (or a dict with a zillion
> string keys), but every time I test "if new_string in seen_strings",
> the computer hashes the new_string using some kind of "short hash",
> checks the set for matching buckets (I'm assuming this is how python
> tests set membership --- is that right?),
So far so good. That applies to all objects, not just strings.
> then checks any
> hash-matches for string equality. Testing string equality is slower
> than integer equality, and strings (unless they are really short)
> take up a lot more memory than long integers.
But presumably you have to keep the string around anyway. It's going to
be somewhere, you can't just throw the string away and garbage collect
it. The dict doesn't store a copy of the string, it stores a reference to
it, and extra references don't cost much.
As for string equality, in pseudo-code it looks something like this:
def __eq__(self, other):
# Consider only the case where other is a string as well.
if self is other: # object identity
return True
if len(self) != len(other): # checking the length is fast
return False
# iterate over both strings in lock-step
for a in self, b in other:
if a != b: return False
return True
only written in fast C code. So the equality test bails out as quickly as
possible, and the only time you have to look at every character is if the
two strings are equal but not the same object.
MD5 hashing, on the other hand, *always* has to look at every character,
and perform quite a complicated set of computations. It's expensive.
> So for that kind of thing, I tend to store MD5 hashes or something
> similar. Is that wrong?
First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.
You're making the cardinal mistake of optimization, not what is slow, but
what you *assume* will be slow. (Or, replace memory consumption for speed
if you prefer.) Python is not C, and your intuitions for what's fast and
slow (low and high memory consumption) are likely to be way off. Consider
trying to store an MD5 hash instead of the string "Hello World!". If I
try this in Python 2.7, I get this:
py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30
I save a lousy 3 bytes. But in order to save that 3 bytes, I have to
generate an md5 object (32 bytes) and a hex string (53 bytes), both of
which take time to create and then time to garbage collect.
Actually, I don't even save those 3 bytes, since for nearly all
reasonable applications, I'll need to keep the string around somewhere.
So instead of saving three bytes, it's costing me thirty bytes for the
hash, plus who-knows-what needed to manage the book keeping to link that
hash to the actual string.
Ah, but maybe we can save time hashing them?
py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191
Times are in seconds for one million iterations of the code being timed.
Or alternatively, it costs about 0.14 microseconds to hash the string,
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate
the MD5 hash in the first place, that takes *much* longer:
py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188
but merely to perform a lightweight hash on the long int so as to find
which bucket of the dict it belongs to.
So, yes, chances are **very** good that you're supposed optimizations in
this area are actually pessimizations. If you have not measured where the
bottlenecks in your code are, you have no idea which parts of the code
need to be optimized.
Now, it is conceivable that there may be some circumstances where your
strategy makes sense. I'm guessing you would need something like this
before you even come close to saving memory and/or time:
- rather than short keys like "Hello World!", you are dealing with
keys that are typically many tens of thousands of characters long;
- most of the strings are the same length and have very similar
prefixes (e.g. the first 3000 chars are the same);
- rather than "zillions" of them, there are few enough of them that
the chances of an MD5 collision is insignificant;
(Any MD5 collision is going to play havoc with your strategy of
using hashes as a proxy for the real string.)
- and you can arrange matters so that you never need to MD5 hash a
string twice.
Is my guess closer to the actuality than yours? I don't know. I haven't
measured it either. But I know that Python is a high-level language with
lots of high-level data structures like dicts which trade-off time and
memory for programmer convenience, and that I'd want to see some real
benchmarks proving that my application was too slow before giving up that
convenience with a complicated strategy like this.
--
Steven D'Aprano
http://import-that.dreamwidth.org/
More information about the Python-list
mailing list