Mutable objects which define __hash__ (was Re: Why are tuples immutable?)

Bengt Richter bokr at oz.net
Sat Dec 25 20:36:47 EST 2004


On Wed, 22 Dec 2004 22:57:01 +1000, Nick Coghlan <ncoghlan at iinet.net.au> wrote:

>OK, I think I need to recap a bit before starting my reply (for my own benefit, 
>even if nobody else's). (The rest of this post will also include a fair bit of 
>repeating things that have already been said elsewhere in the thread)
>
>====
>The actual rule dictionaries use when deciding whether or not to allow a key is 
>simply 'can I hash it?'. Lists don't provide a hash, and tuples only provide a 
>hash if every item they contain provides a hash.
I take it you are referring to regular python dictionaries, since the functionality
of a dictionary does not require hashing at all (as I think I demonstrated in the part
of my dict subclass which put unhashable keys and their values in another kind of
lookup mechanism (namely parallel lists of unhashable keys and their associated values).
More one this follows...
>
>For the dictionary to behave sanely, two keys which compare equal must also hash 
>equal. Otherwise, (key1 == key2) == (d[key1] is d[key2]) may not hold.
Hashing has nothing to do with it. Hashing is just one kind of optimization for implementing
what has to be done. Sane dictionaries depend on being able to take a lookup key object and
finding a unique dict key object that matches with some equality test. After finding the dict's
matching unique effective key object, we can get the object being looked up
(the "value" in d[key] = value), because the dict is representing the dict_key_object:value association
-- somehow.

"Somehow" _may_ be with a reference pair in a bucket found by a hashing mechanism, but it need not be.
Look again (or the first time ?;-) at what happens in my example dict for Antoon when you
use a mutable. The pairing there is maintained by key objects and associated value objects
being at the same index in parallel lists. The list.index method is used to find a unique
key (the first one found is thus by definition the effective unique key (even though other key
values can become equal through mutation, only one will be found equal to a lookup key's value,
so the _effective_ keys are unique. Hashing has nothing to do with it. Being able to find a
unique matching key object in the state of the dict is what provides the functionality. The dict's
effective key object must be comparable via __eq__ or __cmp__ or be coercible via __coerce__ to
something that can be compared one way or another, or the default equality test of
id(lookupkey)==id(dicteffectivekey) for newstyle user objects is applied (I guess ;-) by inheriting
object.__cmp__.

Anyway, what I implemented works for mutables as keys and solves the problem in the dict definition
instead of requiring that key objects be wrapped to conform to hashability requirements of the base
dict. But I did make a choice as to how to maintain a unique _effective_ set of keys for the dict state.
(i.e., prioritizing via list.__index__ when key representations were mutated. Note that any subset of keys
that was hashable was allowed to work as usual. Only the mutables were stored in the mutable key list.

>
>So the one absolute rule on hashes is that (x == y) must imply that (hash(x) == 
>hash(y)). In other words, the hash operation must not involve any object 
>properties that are not also involved in comparison for equality.
>
If you have decided to implement using hashes, yes, but that's not required. And as
Alex pointed out, returning an arbitrary integer constant all the time meets the above
requirement, since hash(x) == hash(y) will be true no matter what, so it will be true
for x == y) (albeit not really "implied" by the latter).

>The rule recommended for hashes is that a mutable object only define a hash if 
>both the comparison operation and the hash are based on immutable portions of 
>the object (such as it's ID). (The language reference currently recommends 
>something stricter - it suggests that you don't define hash at all if you define 
>a custom comparison method on a mutable object. That's overly strict, though)
>
>Now, and here's the key point: Python's builtin objects are written in 
>accordance with this recommendation, which means the mutable objects like dict 
>and list don't define __hash__ at all.
>
>Antoon Pardon wrote:
>> that it would be handy to use such an object as a key and he has some
>> assurance that those objects are stable while in use as a key, I see
>> nothing wrong with using such mutable objects as keys.
>
>Scarily enough, you're starting to persuade me that the idea isn't *completely* 
>insane. Maybe merely mostly insane ;)
>
>The interesting thing is that allowing mutable objects has keys doesn't require 
>*any* changes to dict. It only requires changes to the mutable objects (e.g. 
>having list adopt tuple's hash method).
>
>Anyway, what are the consequences of a type which has a 'variable hash'?
>
>The only real bad consequence is that the internal state of a dictionary can 
>break if someone alters the hash of one of its keys. The restriction to 
>'constant hashes' is aimed squarely at eliminating this class of programming 
>errors. It doesn't actually provide any performance gain.
Why not? It means you only have to compute the hash for any given object once,
and you can cache the hash value. That could be a biggie if you had megabyte trees
of nested tuples as keys and reused them frequently for lookup.
>
>A performance loss would only occur if the dictionary tried to be helpful in 
>detecting when a key mutates - and I see no reason why it should.
Depends on how you are really defining your dict's operation.
>
>>>If you are saying you are happy with "undefined behavior" as the outcome of mutating keys, ok,
>>>but if your implementation lets mutated-key errors pass silently, good luck debugging ;-)
>> 
>> But that problem is not limited to dictionaries. If you make a heapqueue
>> and mutate an element in it, such errors will pass silently too and 
>> will be just as hard to debug. Or suppose you create a class that
>> keeps a sorted list of a number of objects and the programmer accidently
>> mutates an object in it. That will probably pass silently too and
>> give you night mares debugging.
>
>The longer I consider it, the more this seems like a valid analogy. There is 
>nothing preventing dictionaries from having a rehash() method.
>
>Consider:
># Mutate some value in mylist
>mylist.sort()
>
># Mutate some key in mydict
>mydict.rehash()
>
>Rehash would work exactly as Antoon suggested - scan all the hash buckets, and 
>any key which hashes incorrectly gets moved to the right place. Forgetting to 
>rehash() would have consequences no worse than forgetting to sort().
I think, for one thing, it would be more efficient to notice that some keys were guaranteed not
to need rehashing... um, maybe by paritioning the key set into immutables and mutables ;-)

For another, it's not just a matter of "moving" keys that hash "incorrectly". If keys that
are a part of the state of the dict can mutate outside of dict methods, then keys can mutate
to equal each other, and "rehashing" will give equal hashes for previously unequal and distinct keys,
and you must choose which of the previous values is to be the value for the two keys that have now
become one by equality (however __eq__, __cmp__, __coerce__ and inheritance may define equality).

So, as I was trying to show (with somewhat acknowledged effort but scant effect, apparently ;-)
with my example implementation and discussion, you have to DYFR (to coin an acronym -- 
Define Your Fine Requirements ;-) and decide what to do when you find that a mutated
key has become equal to another key and they don't look up the same values.

You could decide that you want to test for non-uniqueness in the key set every time you look
something up, but which key1:value1 key2:value2 pair should survive when one or both key1 and key2
have mutated since the last dict use, and you can't tell what order it happened? I made a simple
choice in the way I implemented the example dict subclass for Antoon: The matching key for a mutable
lookup key is whatever key in the dict's mutable key list matches first via list.__index__.
This provides resolution of the ambiguities, but it _can be_ a little confusing to have d[[1,2]] look up
one thing one time and something else the next time, all because key mutables have been altered without
the dict "knowing." I was trying to encourange facing this effect with open eyes ;-)

(Re)hashing is a red herring, because it does not alter the key set. It just has to do with getting
from a lookup key to suitable comparison candidates in the effective key set (with which the so-called
values are ultimately associated). It is important for optimization purposes, but for the mutable
subset of keys, it could be useless or helpful, depending on how/when/whether keys could mutate, and
how you want to define what your effective unique key set is at lookup time. If hashing immediately
locates a matching key, you could define that to be _the_ effective key of that value, and just ignore
other potentially equal mutated keys, until a lookup hash got no match and then rehash everything and retry.
If key misses were very infrequent, that might be useful. You could cache hashes of mutables with particular
id's to avoid recomputing lookup key hashes if they got re-used a lot, as an optimization hint.
OTOH, if KeyError is frequent because of mutation, hashing may be next to useless overhead. IOW, DYFR ;-)

>
>> The problem is also that you really can't make new immutable objects
I don't understand that one. What do you mean by "new immutable object"?
 >>> a=(1,2)
 >>> b=(1,2)
 >>> a is b
 False
 >>> a==b
 True

>> AFAIK. I have once made a vector class that didn't have any mutating
>> method and I always used it as if it as immutable but because of
>> the "we are all consenting adults here" attitude of python I can't
>> assure that other people of my class won't directly change a vector.
>> 
>> Do you find I shouldn't use these vectors as keys?
>
>This is a fair point, actually:
>
>Py> from decimal import Decimal
>Py> x = Decimal()
>Py> hash(x)
>0
>Py> x._int = 1,
>Py> hash(x)
>1
>
>
>> But mutable objects is fraught with pitfall anyhow. Even mutating an
>> element in a list you are iterating over can give unexpected results
>> or just assigning a mutable to other names a few times and then mutating
>> it through one name and finding out an other name that was supposed to
>> stay stable, mutated too. Why else the warning against the use of
>> a default value of [] for an argument?
>
>Another fair point, although I'm not sure that pointing out that mutable values 
>already give you plenty of ways to shoot yourself in the foot is a point in 
>*favour* of your idea :)
>
>> A last idea, but this definetly is only usefull for debugging, is
>> having a dictionary that keeps a copy of the key, with the ability
>> to check the keys againt the copy, this might help in debugging.
>
>I have a different suggestion: an identity dictionary.
>
>It ignores __hash__, __cmp__ and __eq__, and instead uses id() and is.
>
That is so different that I don't know how it could be used to debug an app that
uses a mutable key dict. I think Antoon was just asking for a key-mutation detector.
I think that's reasonable goal. A deep copy of all keys defined would permit that,
at least when dict methods had control. But you couldn't find the code that mutated
the originals without something outside the dict to catch it in the act.

>Or even something like:
>
>override_dict(id, lambda x, y: x is y)(<normal dict contructor arguments go here>)
>
>> Personnaly I find the atmosphere in general here very defensive when
>> some kind of feature is questioned. There seems a definite "Love it
>> or leave it" attitude to be present.
I think, unfortunately, that emotional tones sometimes are either lost or (mis)acquired
in translation (even from English to English ;-). And what is plain talk in one language/culture
may feel near insulting or discourteously challenging of reputation and credentials in another.

>
>Generally depends on the feature, but yeah, I've seen that happen (and probably 
>been guilty of it, too)
>
>> I have usenet experience enough to know that newsgroups can vary wildly
>> and that you should evaluate each on its own merits. I do agree c.l.py
>> is very helpfull in case you need help in how to solve a particular
>> problem in the language. That niceness can disappear rather swiftly
>> if the pros and cons of python itself get discussed.
Again, I think the form of the "discussion" and the words chosen have something to
do with the outcome. E.g., the word "stupid" does not have much to do with "the pros
and cons of python itself." It might have an implication about the people who contribute
to the design ideas, or who have done the implementation work, or who has made the final
decision to include something in python, but it is not about "pros and cons."

There are other ways to say that you are not getting everything you'd like from python.
IOW, "stupid" is typically really a projection of one's own frustration onto some object
or person, and therefore works to inject emotion into an otherwise dispassionate discussion.

Therefore, e.g., "something is stupid" is not a good abbreviation for stating that something
_appears_ to _oneself_ to be inadequate for some purpose one imagines using it for -- unless one
intends the word "stupid" to stir things up a bit. In the latter case there is the problem of having
intentions misunderstood. I.e., is "stupid" a playful nip between puppies of the same litter,
or a territorial challenge between academic junk-yard dogs?

Clear communication about abstract issues is really difficult. Since we would probably not survive
if our animal instincts were unable to overwhelm our delusional misinterpretations and imaginings
about the world, these instincts sleep lightly when they sleep at all. So if we want to discuss
something that does not concern them, we are IMO well advised not to use words that make them prick up
their ears ;-)

>
>Usually because the ideas are ill-formed and poorly thought out. Unfortunately, 
>the attitude that spawns can bleed over into ideas which aren't without merit, 
>but aren't always expressed clearly.
>
IMO ill-formed and poorly thought out ideas can and often do serve a good purpose.
After all, that's the way most ideas start out ;-)

>> I'm currently not under the impression I'm able to. Sure I could
>> implement the above mentioned classes, but my feeling is that
>> should I present them here, nobody would be waiting for them.
Well, I thought I'd provide a prototype to prime the pump, and in the post I'm
replying to here it didn't even get mentioned. What do you think that does to _my_ ego?
Provides it yet another rich learning experience, that's what ;-)

>> Not because they would be unusfull, but because they go against
>> the current paradigm.
>
>As I see it, you have a couple of courses available:
>
>If you're into pain, consider further developing the 'rehash' idea, and the 
>option of adding hash methods to the mutable builtins. You've persuaded me that 
>the idea is not entirely unreasonable. However, I mention pain, because you'd 
>still have work to do persuading enough people (including me) that the 
>additional hard to detect bugs this adds to the language would be worth it.
I don't see why the language has to change to try these things out, unless
everyone thinks my prototype was completely off the mark. If it was, please DYFR ;-)
>
>I don't really recommend that option.
>
>A potentially more fruitful course is the dictionary variants identity_dict or 
>override_dict. The current dict() implementation forces the use of hash() for 
>sorting into hash buckets and "x == y" for key matching. An identity_dict 
>variant that uses id() (or object.__hash__ in Jython) and "x is y" would seem to 
>quite happily meet the use cases you have posted in this thread, without 
>silencing currently detected bugs.
IMO it's a matter of going back to what you really want your ideal dict to do,
which I presume is to maintain a set of unique defined keys, by whatever means,
and have arbitrary values associated with those unique keys, and be able to use an
arbitrary object to see if that object is equal in some sense to one of the defined
and unique (in the same equality sense) keys, and if so to retrieve the associated
value object, and raise KeyError or give some error indication if not. Then you can
see about optimizing an implementation to suit your intended usage patterns.
You just need to DYFR ;-)

>
>override_dict is probably just a case of compulsive overgeneralisation that can 
>be ignored for the time being :)
At least until requirements are defined ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list