Best practice for caching hash

Wed Mar 16 01:58:22 EDT 2022

On Wed, 16 Mar 2022 at 14:00, Cameron Simpson <cs at cskk.id.au> wrote:
>
> On 16Mar2022 10:57, Chris Angelico <rosuav at gmail.com> wrote:
> >> Is it sensible to compute the hash only from the immutable parts?
> >> Bearing in mind that usually you need an equality function as well and
> >> it may have the same stability issues.
> >
> >My understanding - and I'm sure Marco will correct me if I'm wrong
> >here - is that this behaves like a tuple: if it contains nothing but
> >hashable objects, it is itself hashable, but if it contains anything
> >unhashable, the entire tuple isn't hashable.
>
> A significant difference is that tuples have no keys, unlike a dict.
>
> A hash does not have to hash all the internal state, ony the relevant
> state, and not even all of that. The objective of the hash is twofold to
> my mind:
> - that "equal" objects (in the `==` sense) have the same hash, so that
>   they hash to the same backet in dicts and can therefore be found
> - that hash values are widely distributed, to maximise the evenness of
>   the object distribution in buckets
>
> For dicts to work, the former has to hold.

Python only demands the first one. And Python's own integers hash to
themselves, which isn't what I'd call "widely distributed", but which
works well in practice.

> The latter has more flexibility. A dict has keys. If the dicts are quite
> varied (sets of tags for example), it may be effective to just hash the
> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
> likewise with database rows) this will go badly because the hashes will
> all (or maybe mostly) collide.

The general recommendation is to use all the same data for hashing as
you use for equality checking. What's the point of just hashing the
keys, if the values also contribute to equality? I'd just keep it
simple, and hash both.

ChrisA