Mutable objects which define __hash__ (was Re: Why are tuples immutable?)
Nick Coghlan
ncoghlan at iinet.net.au
Wed Dec 29 07:47:55 EST 2004
<Sections cut where I don't feel I have anything to add to what Bengt already said>
Bengt Richter wrote:
> A second time a key may be hashed is when it is used as a lookup key. This can be a reference to
> the identical key object first used, or it can be a new object. A new object has to be hashed in order
> to have a hash value to use in finding candidate keys to compare to. If _this_ object
> is used as a key again, it must be hashed again -- unless it is an object that caches its own
> hash and invalidates it when mutated so it can rehash itself if its __hash__ method is subsequently
> called.
>
> Thus re-use of a different lookup key than the original provides an opportunity for a performance
> gain in the lookup if the key object caches it hash value for return by __hash__.
> That's what I meant by "why not?" in response to your assertion that the constant hash restriction
> doesn't (ok, I took that as a "can't" ;-) provide any performance gain.
I'm not sure I'm following here. . . for the case of an object caching its own
hash, that seems to be a performance gain that is independent of the dictionary
implementation. From the dictionary's point of view, it just calls hash() and
works with the result, exactly as it does now.
I agree hash caching gives good optimisation - my point was that hash caching
can be just as effective in the presence of mutable keys, so long as there is
some way to invalidate the cached hash values when a mutation occurs.
> Given that you assume you know when "rehashing" is going to be necessary -- meaning you
> know when keys mutate.
Well, yes. Antoon's original complaint was that mutable objects could not be
used as keys in dictionary, even if the application guaranteed key stability for
the lifetime of the relevant dictionary entry.
I gave dict.rehash as a method equivalent to list.sort when working with
mutables keys or list entries, respectively.
>>Or resist the temptation to guess, and have rehash() raise an exception :)
>
> Easy out ;-)
I certainly thought so }:>
>>The identity keyed dict was based on Antoon's posted use case: he wanted to
>>classify mutable objects, and then check for membership in those groups.
>
> I thought he just wanted to use mutables as if they were immutable, so that
> equal mutable keys would look up the same value in a dict.
That was in the original post, yes. Later, he gave an example that ran something
like:
for node in graph:
if node in some_dict:
# Do NOT mutate this node
# Possibly do other things
continue
# Do something that mutates the (non-key) node
>>You can use lists for this, but the lookup is slow. Conventional dictionaries
>>and sets give a fast lookup, but require that the items in use be hashable.
>>
>
> A constant hash works. But since it clogs a single bucket and gives no clue
> of value mismatch, we'd have to time it to see how it compares with list.index.
> Did the Timbot do both?
Write them both? I don't know.
However, list.index is a simple linear search of the array of objects (I was
reading that code recently for unrelated reasons).
The dict hash collision resolution was almost certainly written by Tim - he
definitely wrote the comments at the top of the file trying to explain how the
heck it works. I have no real clue as to the performance of the lookup algorithm
in the degenerate case of a large number of distinct keys having the same hash
value, but reading the code suggests it would perform worse than the simple
integer increment that list.index uses (as the dict lookup requires multiple
if-tests and so forth on each iteration).
>>An identity dictionary (or set) solves the 'object classification' problem
>>perfectly.
>
> I missed the definition of those terms I'm afraid ;-)
It was in some other part of this rambling thread. It's easier to repeat the
idea than it is to go find where that was :)
An 'identity dictionary' is just a dict that looks up keys based on identity
rather than value. It ignores the keys' actual hash and comparison operations,
and forces use of the default hash for hashing, and identity comparison for
equality testing. It maps from key _objects_ to values, rather than from key
values to other values.
An 'identity set' is simply the same idea applied to a set instead of a
dictionary. Conceptually, it stores sets of arbitrary objects, rather than sets
of values.
The 'object classification' problem is simply the situation where you have an
existing collection of objects (e.g. graphs of a node), and you want to classify
them according to some criteria. Dictionaries or sets are a nice way to do that,
but the standard types are a pain if your original objects are mutable (this
inconvenience being the basis of Antoon's original complaint).
For a classification problem, though, we're likely only interested in what
categories each of the original objects falls into - so identity-based storage
will usually be entirely adequate (and, in some cases, actually better then
value-based storage, since we can avoid irrelevant comparisons of potentially
complex objects). And the performance should romp it in when compared to using
lists for the categorisation.
Hence why I suggested Antoon should consider pursuing collections.identity_dict
and collections.identity_set if identity-based lookup would actually address his
requirements. Providing these two data types seemed like a nice way to do an end
run around the bulk of the 'potentially variable hash' key problem.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
More information about the Python-list
mailing list