Mutable objects which define hash (was Re: Why are tuples immutable?)

Tue Dec 28 17:06:34 EST 2004

On Sun, 26 Dec 2004 12:17:42 +1000, Nick Coghlan <ncoghlan at iinet.net.au> wrote:

>Bengt Richter wrote:
>> I take it you are referring to regular python dictionaries,
>
>Correct. So I'll skip over the sections talking about alternate lookup 
>strategies in my reply.
Cool.
>
>>>Anyway, what are the consequences of a type which has a 'variable hash'?
>>>
>>>The only real bad consequence is that the internal state of a dictionary can 
>>>break if someone alters the hash of one of its keys. The restriction to 
>>>'constant hashes' is aimed squarely at eliminating this class of programming 
>>>errors. It doesn't actually provide any performance gain.
>> 
>> Why not? It means you only have to compute the hash for any given object once,
>> and you can cache the hash value. That could be a biggie if you had megabyte trees
>> of nested tuples as keys and reused them frequently for lookup.
>
>Allowing mutation doesn't mean abandoning your caching - it just means you need 
>a way to tell the dict to update it's key cache (i.e. rehash())
There's more than one way a key gets hashed: once when the key:value association
is entered into the dict state. (It pays to save the hash value in the dict along with references
to the key and value objects for several reasons, which I'll get to [1]).

A second time a key may be hashed is when it is used as a lookup key. This can be a reference to
the identical key object first used, or it can be a new object. A new object has to be hashed in order
to have a hash value to use in finding candidate keys to compare to. If _this_ object
is used as a key again, it must be hashed again -- unless it is an object that caches its own
hash and invalidates it when mutated so it can rehash itself if its __hash__ method is subsequently
called.

Thus re-use of a different lookup key than the original provides an opportunity for a performance
gain in the lookup if the key object caches it hash value for return by __hash__.
That's what I meant by "why not?" in response to your assertion that the constant hash restriction
doesn't (ok, I took that as a "can't" ;-) provide any performance gain.

[1] IME a hash is only necessary to optimize operation of a dict. The primary purpose is to limit
the number of comparisons necessary to determine whether a key exists in the dict, and to find the
matching one. First the hash value is converted to a bucket selection. This will involve some
function of the full hash value, reducing it to a value that can access buckets. This function can
potentially change as the dict grows. Once the bucket is selected, a number of keys may be found there.
Since thejob is to find equality, the hash value can serve again, using all its bits, to select a subset
of the keys in the bucket that _may_ be equivalent. This is an optimization to avoid comparing
key objects with differing hashes -- which is fine for immutable keys, which can't be the same
if their hashes differ. IOW, immutable key candidates may be rejected immediately if their hash values
aren't equal to the lookup key's hash value.

The other side of this coin is that keys whose hash values _are_ equal to the lookup key's hash value
must all be compared in some order (probably whatever order is convenient in working through a bucket)
until an equal key is found, or not.

This means that if we give all mutable keys the same hash value, they will all be in the same bucket,
and that bucket will be searched in some order for an equal key, and the first one found to be equal
will be what I called the effective unique key. With a constant hash value -- e.g., zero -- there
is no reason to re-hash: the hashes haven't changed ;-) There may be a reason to eliminate shadowed keys
with duplicate equal values, but it's not necessary for consistent operation.

Based on this, I made a wrapper object for mutable keys, which returns a constant zero as the hash value,
and provides a __cmp__ of the wrapped object for the dict's key equality testing. This means if W is the
wrapper class, you can use W(mutable) as a key in an ordinary dict, and they all go to the same bucket
as 0 and 0.0 and whatever else hashes to 0.

I decided to modify my dict subclass to wrap unhashable key objects automatically, with very similar effect
to the original, except eliminating the parallel lists for the mutable keys and their values. The only thing
is that I knew the order of list.index searching, but I don't know the order of bucket 0 searching, so
shadowing and elimination of duplicate keys by re-instantiation will probably eliminate different duplicates
sometimes.

>
>> I think, for one thing, it would be more efficient to notice that some keys were guaranteed not
>> to need rehashing... um, maybe by paritioning the key set into immutables and mutables ;-)
>
>True, since otherwise rehash() would have to check every key, even the immutable 
>ones. However, that only affects the time required for a rehash(), rather than 
>general dictionary operation.
Given that you assume you know when "rehashing" is going to be necessary -- meaning you
know when keys mutate.

>
>> For another, it's not just a matter of "moving" keys that hash "incorrectly". If keys that
>> are a part of the state of the dict can mutate outside of dict methods, then keys can mutate
>> to equal each other, and "rehashing" will give equal hashes for previously unequal and distinct keys,
>> and you must choose which of the previous values is to be the value for the two keys that have now
>> become one by equality (however __eq__, __cmp__, __coerce__ and inheritance may define equality).
>
>Or resist the temptation to guess, and have rehash() raise an exception :)
Easy out ;-)

>
>> OTOH, if KeyError is frequent because of mutation, hashing may be next to useless overhead. IOW, DYFR ;-)
>
>Indeed. I defined my FR based on Antoon's analogy of the sorted list - have the 
>dict behave like sorted list, so that if you screw the invariant, it's the 
>programmer's job to tell the dictionary to fix it.
>
And how ;-)

>>>>The problem is also that you really can't make new immutable objects
>> 
>> I don't understand that one. What do you mean by "new immutable object"?
>
>Bad terminology on Antoon's part, I think - I believe he meant "new immutable type".
>
>I gave the Python 2.4's Decimal as an example of something which is 
>theoretically immutable, but doesn't actually enforce that immutability properly.
>
>Overriding __setattr__ can give true immutability. It just makes the class a 
>serious pain to write :)
>
>> That is so different that I don't know how it could be used to debug an app that
>> uses a mutable key dict. I think Antoon was just asking for a key-mutation detector.
>> I think that's reasonable goal. A deep copy of all keys defined would permit that,
>> at least when dict methods had control. But you couldn't find the code that mutated
>> the originals without something outside the dict to catch it in the act.
>
>The identity keyed dict was based on Antoon's posted use case: he wanted to 
>classify mutable objects, and then check for membership in those groups.
I thought he just wanted to use mutables as if they were immutable, so that
equal mutable keys would look up the same value in a dict.

>
>You can use lists for this, but the lookup is slow. Conventional dictionaries 
>and sets give a fast lookup, but require that the items in use be hashable.
>
A constant hash works. But since it clogs a single bucket and gives no clue
of value mismatch, we'd have to time it to see how it compares with list.index.
Did the Timbot do both?

>An identity dictionary (or set) solves the 'object classification' problem 
>perfectly.
I missed the definition of those terms I'm afraid ;-)

>
>> You just need to DYFR ;-)
>
>I really do like this acronym :)
>
Me too. Should come in handy ;-)

BTW, here's a mutable key wrapper and the modified dict subclass that uses it:

 >>> class W(object):
 ...     """wrap an arbitrary object to make it usable as dict key"""
 ...     def __init__(self, obj): self.obj = obj
 ...     def __hash__(self):
 ...         try: return hash(self.obj)
 ...         except TypeError: return 0 # => same bucket for non-hashables
 ...     def __cmp__(self, other): return cmp(self.obj, other)
 ...     def __repr__(self): return 'W(%s)' % repr(self.obj)
 ...
 >>> class MKDW(dict):
 ...     """mutable-key dict with automatic wrapping of mutable keys"""
 ...     def __init__(self, arg=None, **kw):
 ...         if not arg and not kw : return
 ...         if arg is None: arg = []
 ...         if isinstance(arg, dict):
 ...             arg = arg.items()
 ...         arg = list(arg)
 ...         arg = arg + kw.items()
 ...         for k,v in arg:
 ...             try: self[k] = v
 ...             except TypeError: self[W(k)] = v
 ...     def __setitem__(self, key, value):
 ...         try: dict.__setitem__(self, key, value)
 ...         except TypeError: dict.__setitem__(self, W(key), value)
 ...     def __getitem__(self, key):
 ...         try: return dict.__getitem__(self, key)
 ...         except TypeError: return dict.__getitem__(self, W(key))
 ...     def __delitem__(self, key):
 ...         try: dict.__delitem__(self, key)
 ...         except (KeyError, TypeError): dict.__delitem__(self, W(key))
 ...     def __repr__(self):
 ...         return '<MKDW %s>' % dict.__repr__(self)
 ...
 >>> d = MKDW()
 >>> d
 <MKDW {}>
 >>> k1, k2, k3 = [1], [2], [3]
 >>> d[k1]=1
 >>> d
 <MKDW {W([1]): 1}>
 >>> d2 = MKDW((k,i+1) for i,k in enumerate([k1,k2,k3]))
 >>> d2
 <MKDW {W([1]): 1, W([2]): 2, W([3]): 3}>
 >>> d2 = MKDW(((k,i+1) for i,k in enumerate([k1,k2,k3])), a=1, b=2)
 >>> d2
 <MKDW {W([1]): 1, W([2]): 2, 'b': 2, 'a': 1, W([3]): 3}>
 >>> d2[[1]]
 1
 >>> d2[[2]]
 2
 >>> d2[[3]]
 3

 >>> k2[0]=3
Note how the modified k2 key shows up
 >>> d2
 <MKDW {W([1]): 1, W([3]): 2, 'b': 2, 'a': 1, W([3]): 3}>

It now shadows the k3 value
 >>> d2[k3]
 2

We can do this to "rehash" and clear out duplicates:
 >>> MKDW(d2)
 <MKDW {W([1]): 1, W([3]): 3, 'b': 2, 'a': 1}>
but the effect depends on how dict implements buckets.

Note that we can wrap a hashable too:
 >>> d3 = MKDW()
 >>> d3[W('a')] = 'eigh?'
 >>> d3
 <MKDW {W('a'): 'eigh?'}>

It winds up in the same hash bucket as the unwrapped object, so
key compares in that bucket find it:
 >>> d3['a']
 'eigh?'
 >>> d3['a'] = 'new a value'

Note that assignment via an equal-to-current key doesn't change
the dict's reference to the original key:
 >>> d3
 <MKDW {W('a'): 'new a value'}>

You could add extra parameters to W.__init__ and change __hash__
and __cmp__ to do what you wanted to do with graph nodes, I think,
even though I don't know quite what that was ;-)

Regards,
Bengt Richter

Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Mutable objects which define hash (was Re: Why are tuples immutable?)