[Python-3000] set literals

Tue Jul 11 01:21:45 CEST 2006

On 7/10/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
> Guido van Rossum wrote:
>
> >I've also sometimes thought of unifying dicts and sets by implementing
> >set operations on the keys of dicts only. When the values aren't the
> >same, we could make an arbitrary decision e.g. the left operand wins.
> >You get quite far. E.g.
> >
> >a = {1: "a", 2: "b"}
> >b = {1: "c", 3: "d"}
> >
> ># These already work:
> >assert 1 in a
> >assert 1 in b
> >assert 3 not in a
> ># etc.
> >
> ># These would be new (the equivalent augmented assignments would work too):
> >assert a|b == {1: "a", 2: "b", 3: "c"}

(That should be 3: "d" BTW.)

> >assert a&b == {1: "a"}
> >assert a^b == {2: "b", 3: "d"}
> >assert a-b == {2: "b"}
> >assert b-a == {3: "d"}
> >
> ># We could use these equivalencies:
> >assert {1} == {1: None} == set({1: "a"})   # I.e. canonical sets have
> >all None values
> >
> ># And of course it would solve the empty set notation problem nicely:
> >assert dict() == {} == set()
> >
> >Unfortunately we couldn't redefine <=, <, >=, > to mean various
> >subset/superset tests without backwards incompatibilities (but does
> >anyone care?), and == and != would of course take the values into
> >account which would occasionally sting. Also, sets would grow some
> >operations that don't make a lot of sense (e.g. __getitem__, get,
> >setdefault) but that's minor once you know the same implementation is
> >used.
> >
> >I still expect there's a fatal flaw in the scheme, but I can't think
> >of it right now...
>
> I don't think there is a fatal flaw in terms of implementation -- a
> little modification of sets.py would demonstrate that with certainly.
>
> There would be some implementation consequences in terms of speed and
> memory usage (we would lose the memory savings and some of the speed-ups
> gained in the Py2.5 implementation of sets).  The storage size would
> grow 50%.  All set building operations would incur the overhead of
> copying in values as well as keys.

All of those could probably be eliminated through sheer willpower.
E.g. a flag "are any values non-None" would go a long way. Of course,
the 50% growth is only in terms of memory for the dict itself; if we
take the space for the keys into account the percentage would be much
smaller.

> The biggest issues are on the conceptual side -- do we think about set
> operations and mapping operations in the same way or in different ways?
> Guido listed the basic conflicts 1) arbitrary decisions as to which
> values to keep, 2) unexpected stings from != and == taking values into
> account, 3) operations that don't make sense  (__setitem__, setdefault,
> etc).
>
> The deeper problem is that decisions on which values to keep will
> inevitability break set invariants (see a long list of these in
> test.test_set.py):
>
>    assert a|b == b|a, 'commutative property'
>    assert (a-b) | (a&b) | (b-a) == a|b, 'whole is the sum of the parts'

Yeah, I think when I was cosidering this earlier (it's been a shower
thought for quite a while on and off) I considered those fatal flaws.
Today, I think the invariants would simply change to

  assert set(a|b) == set(b|a) == set(a)|set(b) == set(b)|set(a)

with set() being an alias for dict.fromkeys(). (I wonder if we need a
similar thing that doesn't make a copy if the input is already a set,
i.e. has all-None values.)

> The notion of sets carries so much mathematical significance that the
> loss of expected invariants would be a recurring surprise.

But only when mixing sets with non-sets.

> On the positive side, I've occasionally wanted some set style operations
> for dictionaries (i.e. give me all the entries in d1 that aren't in d2,
> etc.)

Right. Plus of course there's a long history of pre-2.3 code that
(ab)uses dicts as sets. (We could get rid of the "ab". :-)

> The LUA programming language demonstrated that even unifying lists and
> dicts is possible and that people can get used to just about anything.

And ABC has lists and tables that were quite different from Python's
lists and dicts...

> The question boils down to what is best for the user in terms of
> learnability, clarity, reliability, and performance.

Where the unified data type doesn't necessarily score poorly -- one
less type means less to learn, less code, and fewer choices to make.

I'm still very much undecided but I don't want to rule this out for
py3k. Perhaps I'll write up a PEP and see how it flies.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)