[Python-ideas] incremental hashing in __hash__

jab at math.brown.edu jab at math.brown.edu
Thu Dec 29 15:24:10 EST 2016


Thanks for the thoughtful discussion, it's been very interesting.

Hash algorithms seem particularly sensitive and tricky to get right, with a
great deal of research going into choices of constants, etc. and lots of
gotchas. So it seemed worth asking about. If I had to bet on whether
repeatedly accumulating pairwise hash() results would maintain the same
desired properties that hash(tuple(items)) guarantees, I'd want to get
confirmation from someone with expertise in this first, hence my starting
this thread.

But as you showed, it's certainly possible to do some exploration in the
meantime. Prompted by your helpful comparison, I just put together
https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
compare hash_tuple vs. hash_incremental.

Based on that, hash_incremental seems to have a comparable distribution to
hash_tuple. I'm not sure if the methodology there is sound, as I'm new to
analysis like this. So I'd still welcome confirmation from someone with
actual expertise in Python's internal hash algorithms. But so far this sure
seems good enough for the use cases I described.

Given sufficiently good distribution, I'd expect there to be unanimous
agreement that an immutable collection, which could contain arbitrarily
many items, should strongly prefer hash_incremental(self) over
hash(tuple(self)), for the same reason we use generator comprehensions
instead of list comprehensions when appropriate. Please correct me if I'm
wrong.

+1 for the "hash.from_iterable" API you proposed, if some additional
support for this is added to Python. Otherwise +1 for including Ned's
recipe in the docs. Again, happy to submit a patch for either of these if
it would be helpful.

And to be clear, I really appreciate the time that contributors have put
into this thread, and into Python in general. Thoughtful responses are
always appreciated, and never expected. I'm just interested in learning and
in helping improve Python when I might have an opportunity. My Python open
source work has been done on a voluntary basis too, and I haven't even
gotten to use Python for paid/closed source work in several years, alas.

Thanks again,
Josh

On Thu, Dec 29, 2016 at 3:20 AM, Steven D'Aprano <steve at pearwood.info>
wrote:

> On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab at math.brown.edu wrote:
> > On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan at stoneleaf.us>
> wrote:
> >
> > > In other words, objects that do not compare equal can also have the
> same
> > > hash value (although too much of that will reduce the efficiency of
> > > Python's containers).
> > >
> >
> > Yes, I realize that unequal objects can return the same hash value with
> > only performance, and not correctness, suffering. It's the performance
> I'm
> > concerned about. That's what I meant by "...to keep from unnecessarily
> > causing hash collisions..." in my original message, but sorry this wasn't
> > clearer. We should be able to do this in a way that doesn't increase hash
> > collisions unnecessarily.
>
> With respect Josh, I feel that this thread is based on premature
> optimization. It seems to me that you're *assuming* that anything less
> than some theoretically ideal O(1) space O(N) time hash function is
> clearly and obviously unsuitable.
>
> Of course I might be completely wrong. Perhaps you have implemented your
> own __hash__ methods as suggested by the docs, as well as Ned's version,
> and profiled your code and discovered that __hash__ is a significant
> bottleneck. In which case, I'll apologise for doubting you, but in my
> defence I'll say that the language you have used in this thread so far
> gives no hint that you've actually profiled your code and found the
> bottleneck.
>
> As I see it, this thread includes a few questions:
>
> (1) What is a good way to generate a hash one item at a time?
>
> I think Ned's answer is "good enough", subject to evidence to the
> contrary. If somebody cares to spend the time to analyse it, that's
> excellent, but we're volunteers and its the holiday period and most
> people have probably got better things to do. But we shouldn't let the
> perfect be the enemy of the good.
>
> But for what it's worth, I've done an *extremely* quick and dirty test
> to see whether the incremental hash function gives a good spread of
> values, by comparing it to the standard hash() function.
>
>
> py> import statistics
> py> def incrhash(iterable):
> ...     h = hash(())
> ...     for x in iterable:
> ...             h = hash((h, x))
> ...     return h
> ...
> py>
> py> data1 = []
> py> data2 = []
> py> for i in range(1000):
> ...     it = range(i, i+100)
> ...     data1.append(hash(tuple(it)))
> ...     data2.append(incrhash(it))
> ...
> py> # Are there any collisions?
> ... len(set(data1)), len(set(data2))
> (1000, 1000)
> py> # compare spread of values
> ... statistics.stdev(data1), statistics.stdev(data2)
> (1231914201.0980585, 1227850884.443638)
> py> max(data1)-min(data1), max(data2)-min(data2)
> (4287398438, 4287569008)
>
>
> Neither the built-in hash() nor the incremental hash gives any
> collisions over this (admittedly small) data set, and both have very
> similar spreads of values as measured by either the standard deviation
> or the statistical range. The means are quite different:
>
> py> statistics.mean(data1), statistics.mean(data2)
> (-8577110.944, 2854438.568)
>
> but I don't think that matters. So that's good enough for me.
>
>
> (2) Should Ned's incremental hash, or some alternative with better
> properties, go into the standard library?
>
> I'm not convinced that your examples are common enough that the stdlib
> should be burdened with supporting it. On the other hand, I don't think
> it is an especially *large* burden, so perhaps it could be justified.
> Count me as sitting on the fence on this one.
>
> Perhaps a reasonable compromise would be to include it as a recipe in
> the docs.
>
>
> (3) If it does go in the stdlib, where should it go?
>
> I'm suspicious of functions that change their behaviour depending on how
> they are called, so I'm not keen on your suggestion of adding a second
> API to the hash built-in:
>
>     hash(obj)  # return hash of obj
>
>     hash(iterable=obj)  # return incrementally calculated hash of obj
>
> That feels wrong to me. I'd rather add a generator to the itertools
> module:
>
>     itertools.iterhash(iterable)  # yield incremental hashes
>
> or, copying the API of itertools.chain, add a method to hash:
>
>     hash.from_iterable(iterable)  # return hash calculated incrementally
>
>
>
> --
> Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161229/528247d6/attachment.html>


More information about the Python-ideas mailing list