Best practice for caching hash

Wed Mar 16 03:47:05 EDT 2022

On 16Mar2022 16:58, Chris Angelico <rosuav at gmail.com> wrote:
>On Wed, 16 Mar 2022 at 14:00, Cameron Simpson <cs at cskk.id.au> wrote:
>> On 16Mar2022 10:57, Chris Angelico <rosuav at gmail.com> wrote:
>> A significant difference is that tuples have no keys, unlike a dict.
>>
>> A hash does not have to hash all the internal state, ony the relevant
>> state, and not even all of that. The objective of the hash is twofold to
>> my mind:
>> - that "equal" objects (in the `==` sense) have the same hash, so that
>>   they hash to the same backet in dicts and can therefore be found
>> - that hash values are widely distributed, to maximise the evenness of
>>   the object distribution in buckets
>>
>> For dicts to work, the former has to hold.
>
>Python only demands the first one.

I think we're in violent agreement here.

>And Python's own integers hash to
>themselves, which isn't what I'd call "widely distributed", but which
>works well in practice.

This may be because the "raw" hash (in this case the int value) is 
itself further hashed (giving more evenness) and then moduloed into the 
finite number of slots in the dict.

>> The latter has more flexibility. A dict has keys. If the dicts are quite
>> varied (sets of tags for example), it may be effective to just hash the
>> keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or
>> likewise with database rows) this will go badly because the hashes will
>> all (or maybe mostly) collide.
>
>The general recommendation is to use all the same data for hashing as
>you use for equality checking. What's the point of just hashing the
>keys, if the values also contribute to equality? I'd just keep it
>simple, and hash both.

Well, hash functions want to be fast. The case where you look up a key 
in a dict (which needs both the hash and a comparison) is not the only 
situation. When a dict's bucket array is resized, the hash functions of 
all stored objects require recomputation, but the equality test is _not_ 
required. Is the hash is very cheap, that is a large benefit here.

I just took a look at the CPython hashtable.c, you can see the rehashing 
process here:

    https://github.com/python/cpython/blob/7c776521418676c074a483266339d31c950f516e/Python/hashtable.c#L280

Just an aside, that's got a bunch of interesting opimitsations in it, 
people has put serious work into making this fast and it remains very 
readable!

Suppose I have a nested tree of (hashable) frozen dicts of whatever 
flavour. An equality test requires a deep comparison all the way down 
including the values. You could make a much cheaper hash function which 
hashed just keys, yea, even only unto the upper layer or 2, and still 
meet the primary criterion: all equal items have the same hash value.

My recommendation is not inherently to hash everything involved in the 
equality test. If that's cheap, then sure. But it isn't necessary and 
(given a suitable use domain) it may even be very detrimental to hash 
everything involved in equality, as in the example above.

Cheers,
Cameron Simpson <cs at cskk.id.au>