Sets in Python

Steve Holden steve at holdenweb.com
Thu Sep 20 05:59:06 EDT 2007


Karthik Gurusamy wrote:
> On Sep 19, 7:17 pm, Steven D'Aprano <st... at REMOVE-THIS-
> cybersource.com.au> wrote:
>> On Wed, 19 Sep 2007 20:58:03 +0000, Karthik Gurusamy wrote:
>>> While it's easy to explain the behavior, I think the decision to dis-
>>> allow mutable items as keys is a bit arbitrary. There is no need for
>>> dict to recompute hash
>> What???
>>
>> Of course it does. How else can it look up the key? Because it (somehow)
>> just recognizes that it has seen the key before? How? By magic?
> 
> You answered it yourself later. For a mapping service, hash is just
> one way to do things. What you need is for each item in the
> collection, a unique key.
> How you go from the key to the value is not something a programmer
> needs to know.
> Your mind is set on thinking on hash alone and hence you don't see
> beyond it.
> 
There's a reason for that: the developers (and particularly Tim Peters) 
have, to use a phrase Tim is fond of "optimized the snot" out of dict, 
and the mechanism is fundamental to much of Python's internals. So don't 
expect to be able to tamper wiht it without adversely affectinf performance.

>>> (first of all, a user doesn't even need to know
>>> if underneath 'hashing' is used -- the service is just a mapping between
>>> one item to another item).
>> The user doesn't need to know the mechanism, but the dict does. Dicts are
>> implemented as hash tables. I suppose they could be implemented as
>> something else (what? linear lists? some sort of tree?) but the same
>> considerations must be made:
> 
> Oh yes. If the keys are all integers (or any set of items that can be
> ordered), why not an avl. It has guaranteed O(log N) while a hash in
> worst case is O(N). Why you want to tie yourself to the drawbacks of
> one datastructure? Understand your goal is not to provide a hash; but
> to provide a mapping service.
> 
Possibly so, but the goals also include "excellent speed" and "as wide a 
set of keys as is (practically) possible".

How would you suggest Python avoids "[tying itself] to the drawbacks of 
one data structure"? Implement different strategies according to the key 
values used? That'll surely speed things up, not.

Python provides you with plenty of tools to implement optimized data 
structures for your own applications. Arguing for making them language 
primitives merely reveals your design inexperience.
> 
>  the dict must be able to find keys it has
>> seen before. How is the dict supposed to recognise the key if the key has
>> changed?
>>
>>> Since we know hashing is used, all that is needed is, a well-defined way
>>> to construct a hash out of a mutable. "Given a sequence, how to get a
>>> hash" is the problem.
>> Nonsense. That's not the problem. The problem is how to get exactly the
>> same hash when the sequence has changed.
> 
> Yes, if you keep thinking hash is the only tool you got.
> 
>> In other words, if you have two mutable objects M1 and M2, then you
>> expect:
>>
> 
> No. I don't expect. I expect the hash to be different. Why do you keep
> thinking it's the mappings responsibility to take care of a changing
> key.
> 
Because mappings must do precisely that. Do you actually know what a 
mapping *is*?

>> hash(M1) == hash(M2) if and only if M1 and M2 are equal
>> hash(M1) != hash(M2) if M1 and M2 are unequal
>>
>> but also:
>>
>> if M1 mutates to become equal to M2, hash(M1) must remain the same while
>> still being different from hash(M2).
>>
>> That means that hash() now is a non-deterministic function. hash([1,2,3])
>> will vary according to how the list [1,2,3] was constructed!
>>
>> Obviously something has to give, because not all of these things are
>> mutually compatible.
>>
>>> If later the given sequence is different, that's
>>> not the dict's problem.
>> Data structures don't have problems. Programmers do. And language
>> designers with sense build languages that minimize the programmers
>> problems, not maximize them.
> 
> Yes, here you talk about a different goal altogether. Here comes the
> 'arbitrary' part I mentioned.
> 
>>> So if the list changes, it will result in a different hash and we will
>>> get a hash-miss. I doubt this is in anyway less intuitive than dis-
>>> allowing mutable items as keys.
>> The choices for the language designer are:
>>
>> (1) Invent some sort of magical non-deterministic hash function which
>> always does the Right Thing.
> 
> Nope, just say if the new sequence is different, you don't find the
> item in the dict.
> 
>> (2) Allow the hash of mutable objects to change, which means you can use
>> mutable objects as keys in dicts but if you change them, you can no
>> longer find them in the dict. They'll still be there, using up memory,
>> but you can't get to them.
> 
> In the new model, at the time of addition, you need to remember the
> key at that time. If it's a list, you make a copy of the items.
> 
Right. Simple. And completely impractical.

>> (3) Simply disallow mutable objects as keys.
>>
>> Alternative 1 is impossible, as we've seen, because the requirements for
>> the Right Thing are not mutually compatible.
>>
>> Alternative (2) leads to hard-to-find, hard-to-diagnose bugs where you
>> store objects in a dict only for them to mysteriously disappear later.
>> Worse, it could lead to bugs like the following hypothetical:
> 
> Of course they can be reached with.. for k in dict...

But that hardly provides the required mapping features. What on earth 
are you thinking?.

>>>>> M = [1, 2, 3]
>>>>> D = {M: 'parrot'} # pretend this works
>>>>> D
>> {[1, 2, 3]: 'parrot'}>>> M.append(4)
>>>>> D
>> {[1, 2, 3, 4]: 'parrot'}>>> D[[1, 2, 3, 4]]
> 
> No, in the new way, the key still remains [1, 2, 3]
> What was changed is M. Not the  key given to dict at the time of
> addition.
> Again I'm not describing today's behavior; it's in the new way.
> 
I doubt this new way, whatever it is, is going to have many disciples.

>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> KeyError: [1, 2, 3, 4]
>>
>> Try explaining that one to programmers: they can SEE the key in the dict
>> when they print it, but they can't get it or delete it because the hash
>> has changed.
> 
> No they don't. They see the key at the time of addition ([1,2,3])
>> Alternative 3 is easy to deal with: simply don't use mutable objects as
>> keys. That's what Python does. Sure, the programmer sometimes needs to
>> work around the lack (convert the list into a tuple, or a string, or
>> pickle it, whatever...) which on rare occasions is hard to do, but at
>> least there are no mysterious, hard to track down bugs.
> 
> When I first saw key's must'be be mutable, it did appear to me to be
> mysterious. There was unnecessary tighter coupling between
> implementation details and the service exposed to the programmer. (As
> I see it, the root cause of all this is, the dict does not make a copy
> of the key at the time of item addition, it just makes a new reference
> to the same object)
> 
Light dawns. Dicts are optimized for PERFORMANCE! And for very good reasons.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden

Sorry, the dog ate my .sigline




More information about the Python-list mailing list