Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Nick Coghlan ncoghlan at iinet.net.au
Wed Dec 29 07:47:55 EST 2004


<Sections cut where I don't feel I have anything to add to what Bengt already said>

Bengt Richter wrote:
> A second time a key may be hashed is when it is used as a lookup key. This can be a reference to
> the identical key object first used, or it can be a new object. A new object has to be hashed in order
> to have a hash value to use in finding candidate keys to compare to. If _this_ object
> is used as a key again, it must be hashed again -- unless it is an object that caches its own
> hash and invalidates it when mutated so it can rehash itself if its __hash__ method is subsequently
> called.
> 
> Thus re-use of a different lookup key than the original provides an opportunity for a performance
> gain in the lookup if the key object caches it hash value for return by __hash__.
> That's what I meant by "why not?" in response to your assertion that the constant hash restriction
> doesn't (ok, I took that as a "can't" ;-) provide any performance gain.

I'm not sure I'm following here. . . for the case of an object caching its own 
hash, that seems to be a performance gain that is independent of the dictionary 
implementation. From the dictionary's point of view, it just calls hash() and 
works with the result, exactly as it does now.

I agree hash caching gives good optimisation - my point was that hash caching 
can be just as effective in the presence of mutable keys, so long as there is 
some way to invalidate the cached hash values when a mutation occurs.

> Given that you assume you know when "rehashing" is going to be necessary -- meaning you
> know when keys mutate.

Well, yes. Antoon's original complaint was that mutable objects could not be 
used as keys in dictionary, even if the application guaranteed key stability for 
the lifetime of the relevant dictionary entry.

I gave dict.rehash as a method equivalent to list.sort when working with 
mutables keys or list entries, respectively.

>>Or resist the temptation to guess, and have rehash() raise an exception :)
> 
> Easy out ;-)

I certainly thought so }:>

>>The identity keyed dict was based on Antoon's posted use case: he wanted to 
>>classify mutable objects, and then check for membership in those groups.
> 
> I thought he just wanted to use mutables as if they were immutable, so that
> equal mutable keys would look up the same value in a dict.

That was in the original post, yes. Later, he gave an example that ran something 
like:

   for node in graph:
     if node in some_dict:
       # Do NOT mutate this node
       # Possibly do other things
       continue
     # Do something that mutates the (non-key) node

>>You can use lists for this, but the lookup is slow. Conventional dictionaries 
>>and sets give a fast lookup, but require that the items in use be hashable.
>>
> 
> A constant hash works. But since it clogs a single bucket and gives no clue
> of value mismatch, we'd have to time it to see how it compares with list.index.
> Did the Timbot do both?

Write them both? I don't know.

However, list.index is a simple linear search of the array of objects (I was 
reading that code recently for unrelated reasons).

The dict hash collision resolution was almost certainly written by Tim - he 
definitely wrote the comments at the top of the file trying to explain how the 
heck it works. I have no real clue as to the performance of the lookup algorithm 
in the degenerate case of a large number of distinct keys having the same hash 
value, but reading the code suggests it would perform worse than the simple 
integer increment that list.index uses (as the dict lookup requires multiple 
if-tests and so forth on each iteration).

>>An identity dictionary (or set) solves the 'object classification' problem 
>>perfectly.
> 
> I missed the definition of those terms I'm afraid ;-)

It was in some other part of this rambling thread. It's easier to repeat the 
idea than it is to go find where that was :)

An 'identity dictionary' is just a dict that looks up keys based on identity 
rather than value. It ignores the keys' actual hash and comparison operations, 
and forces use of the default hash for hashing, and identity comparison for 
equality testing. It maps from key _objects_ to values, rather than from key 
values to other values.

An 'identity set' is simply the same idea applied to a set instead of a 
dictionary. Conceptually, it stores sets of arbitrary objects, rather than sets 
of values.

The 'object classification' problem is simply the situation where you have an 
existing collection of objects (e.g. graphs of a node), and you want to classify 
them according to some criteria. Dictionaries or sets are a nice way to do that, 
but the standard types are a pain if your original objects are mutable (this 
inconvenience being the basis of Antoon's original complaint).

For a classification problem, though, we're likely only interested in what 
categories each of the original objects falls into - so identity-based storage 
will usually be entirely adequate (and, in some cases, actually better then 
value-based storage, since we can avoid irrelevant comparisons of potentially 
complex objects). And the performance should romp it in when compared to using 
lists for the categorisation.

Hence why I suggested Antoon should consider pursuing collections.identity_dict 
and collections.identity_set if identity-based lookup would actually address his 
requirements. Providing these two data types seemed like a nice way to do an end 
run around the bulk of the 'potentially variable hash' key problem.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at email.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://boredomandlaziness.skystorm.net



More information about the Python-list mailing list