Sets in Python

Thu Sep 20 07:46:29 EDT 2007

On Thu, 20 Sep 2007 04:02:03 +0000, Karthik Gurusamy wrote:

> On Sep 19, 7:17 pm, Steven D'Aprano <st... at REMOVE-THIS-
> cybersource.com.au> wrote:
>> On Wed, 19 Sep 2007 20:58:03 +0000, Karthik Gurusamy wrote:
>> > While it's easy to explain the behavior, I think the decision to dis-
>> > allow mutable items as keys is a bit arbitrary. There is no need for
>> > dict to recompute hash
>>
>> What???
>>
>> Of course it does. How else can it look up the key? Because it
>> (somehow) just recognizes that it has seen the key before? How? By
>> magic?
> 
> You answered it yourself later. For a mapping service, hash is just one
> way to do things. What you need is for each item in the collection, a
> unique key.

And does the mapping get access to that unique key (the key's key)? It 
can't keep a mapping of object:key, because if it could do that, it 
wouldn't need the key, it could just keep object:payload.

Since the mapping can't know the key's key, it has to ask the key, and 
that's what the __hash__ method does.

> How you go from the key to the value is not something a programmer needs
> to know.

You are correct. All you need to know is that in Python, you can't use 
lists and sets as keys directly.

You only need to know about the details of the way dicts look up keys if 
you are writing your own class, and you aren't happy with Python's 
default hash for instance objects. It isn't a common task.

> Your mind is set on thinking on hash alone and hence you don't see
> beyond it.

No, these issues exist regardless of the implementation of the mapping. 
Whether you use a hash table or a binary tree of some sort, or a linear 
linked list, or a database, or folders in a filing cabinet.

The behaviour of mutable objects in a mapping is always problematic, 
regardless of the mapping implementation.

>> > (first of all, a user doesn't even need to know if underneath
>> > 'hashing' is used -- the service is just a mapping between one item
>> > to another item).
>>
>> The user doesn't need to know the mechanism, but the dict does. Dicts
>> are implemented as hash tables. I suppose they could be implemented as
>> something else (what? linear lists? some sort of tree?) but the same
>> considerations must be made:
> 
> Oh yes. If the keys are all integers (or any set of items that can be
> ordered), why not an avl. It has guaranteed O(log N) while a hash in
> worst case is O(N).

But both the best and average cases are O(1), which beats AVL trees by a 
lot. Since dicts are used for Python's internals, they are HIGHLY 
optimized and VERY fast. Their O(1) will beat the O(log N) of AVL trees 
easily.

Hash tables can also use keys even if they can't be ordered: {1+2j: None}

> Why you want to tie yourself to the drawbacks of one
> datastructure? Understand your goal is not to provide a hash; but to
> provide a mapping service.

No, the goal of a good language designer is to provide the fastest, most 
lightweight, simplest, most flexible mapping as a built-in. Any old slow, 
heavyweight, complicated, inflexible mapping will not do the job. That 
goal is best provided by a hash table.

If people want additional mappings as well, they can be added as 
libraries. If those libraries are very useful, they can be added to the 
standard library. If they are HUGELY useful, they will become built-ins.

AVL trees are not as flexible or fast as hash tables, and even if they 
were, you would *still* need to resolve the difficulties that occur if 
the keys mutate.

>> the dict must be able to find keys it has
>> seen before. How is the dict supposed to recognise the key if the key
>> has changed?
>>
>> > Since we know hashing is used, all that is needed is, a well-defined
>> > way to construct a hash out of a mutable. "Given a sequence, how to
>> > get a hash" is the problem.
>>
>> Nonsense. That's not the problem. The problem is how to get exactly the
>> same hash when the sequence has changed.
> 
> Yes, if you keep thinking hash is the only tool you got.

Fine, let me re-word it in terms of an AVL.

Suppose you have two lists in an AVL:

L1 = [1, 2, 3]
L2 = [4, 5, 6]
M = avl((L1, True), (L2, False))

The tree M (for mapping) has L1 at the top of the tree, and L2 as the 
right node.

But now, EVERY time ANY mutable object changes, Python has to check 
whether it is a key in EVERY avl, and if so, re-built the tree. Otherwise 
the tree can become corrupt because the AVL invariants are no longer true.

(Consider what happens if we say L1[0] = 999. Now L1 > L2. If you don't 
reorder the avl, M[L2] cannot be reached except by an exhaustive search 
of every node. That means it is no longer an AVL tree, just an inordered 
tree. Might as well save memory and use a linear linked list.)

Needless to say, this will slow down Python just a tad.

Look, you can go back to the archives of 1993 when this was first 
discussed. Nothing has changed since then. Mutable keys are still 
problematic, regardless of the implementation, and the simplest solution 
is to simply prohibit them.

http://www.python.org/search/hypermail/python-1993/0044.html

If you want to use a mutable object as a key, by object identity rather 
than value, then the easy way is to wrap it in an instance:

class Wrapper: # don't even need new-style classes
    pass # or even an __init__

key = Wrapper()
key.payload = [1, 2, 3]
D = {key: "value"}

But of course you can't look up the dict by value, only by identity. But 
that's what you wanted.

Another way is to use this class:

class HashableList(list):
    def __hash__(self):
        return hash(tuple(self))

Have fun, and remember us when you're debugging your code, because you'll 
be doing a lot of it.

-- 
Steven.