Best practice for caching hash

Tue Mar 15 22:58:53 EDT 2022

On 16Mar2022 10:57, Chris Angelico <rosuav at gmail.com> wrote:
>> Is it sensible to compute the hash only from the immutable parts?
>> Bearing in mind that usually you need an equality function as well and
>> it may have the same stability issues.
>
>My understanding - and I'm sure Marco will correct me if I'm wrong
>here - is that this behaves like a tuple: if it contains nothing but
>hashable objects, it is itself hashable, but if it contains anything
>unhashable, the entire tuple isn't hashable.

A significant difference is that tuples have no keys, unlike a dict.

A hash does not have to hash all the internal state, ony the relevant 
state, and not even all of that. The objective of the hash is twofold to 
my mind:
- that "equal" objects (in the `==` sense) have the same hash, so that 
  they hash to the same backet in dicts and can therefore be found
- that hash values are widely distributed, to maximise the evenness of 
  the object distribution in buckets

For dicts to work, the former has to hold.

The latter has more flexibility. A dict has keys. If the dicts are quite 
varied (sets of tags for example), it may be effective to just hash the 
keys. But if the dict keys are similar (labelled CSV-rows-as-dicts, or 
likewise with database rows) this will go badly because the hashes will 
all (or maybe mostly) collide.

>As such, any valid hash value will be stable, and "asking for a hash
>will raise TypeError" is also stable.

I would seek to avoid a TypeError for a frozendict, but as you can see 
above I have not thought of a way to do that which would also have 
desireable hash characteristics in almost all circumstances. (I think we 
can accept that almost anything will have pathological cases, but the 
bad cases in my hash-the-keys notion are not, to my mind, rare.)

>> >The problem is the first time I get an error with details, for example:
>> >TypeError: unhashable type: 'list'
>> >The subsequent times I simply raise a generic error:
>> >TypeError
>>
>> You could, you know, cache the original exception. That does keep links
>> to the traceback and therefore all the associates stack frames, so that
>> isn't cheap (it is peerfectly fast - just the reference, t just prevents
>> those things from being reclaimed).
>
>I don't like that idea myself, for that exact reason - it'll keep
>arbitrary numbers of objects alive.

I don't like it either, for that exact reason. That reason is the same 
reason which has Python 3 exception variables get unset as you leave an 
`except` clause. I'm sure it irks everyone, but the memory penalty of 
not doing so is high.

>But caching the stringified form
>would be more reasonable here, and have similar effect.

Mmm, yes.

Cheers,
Cameron Simpson <cs at cskk.id.au>