[Python-ideas] PEP: Dict addition and subtraction

Tue Mar 5 19:46:57 EST 2019

On Wed, Mar 6, 2019 at 12:08 AM Guido van Rossum <guido at python.org> wrote:

> On Tue, Mar 5, 2019 at 3:50 PM Josh Rosenberg <
> shadowranger+pythonideas at gmail.com> wrote:
>
>>
>> On Tue, Mar 5, 2019 at 11:16 PM Steven D'Aprano <steve at pearwood.info>
>> wrote:
>>
>>> On Sun, Mar 03, 2019 at 09:28:30PM -0500, James Lu wrote:
>>>
>>> > I propose that the + sign merge two python dictionaries such that if
>>> > there are conflicting keys, a KeyError is thrown.
>>>
>>> This proposal is for a simple, operator-based equivalent to
>>> dict.update() which returns a new dict. dict.update has existed since
>>> Python 1.5 (something like a quarter of a century!) and never grown a
>>> "unique keys" version.
>>>
>>> I don't recall even seeing a request for such a feature. If such a
>>> unique keys version is useful, I don't expect it will be useful often.
>>>
>>
>> I have one argument in favor of such a feature: It preserves
>> concatenation semantics. + means one of two things in all code I've ever
>> seen (Python or otherwise):
>>
>> 1. Numeric addition (including element-wise numeric addition as in
>> Counter and numpy arrays)
>> 2. Concatenation (where the result preserves all elements, in order,
>> including, among other guarantees, that len(seq1) + len(seq2) == len(seq1 +
>> seq2))
>>
>> dict addition that didn't reject non-unique keys wouldn't fit *either*
>> pattern; the main proposal (making it equivalent to left.copy(), followed
>> by .update(right)) would have the left hand side would win on ordering, the
>> right hand side on values, and wouldn't preserve the length invariant of
>> concatenation. At least when repeated keys are rejected, most concatenation
>> invariants are preserved; order is all of the left elements followed by all
>> of the right, and no elements are lost.
>>
>
> I must by now have seen dozens of post complaining about this aspect of
> the proposal. I think this is just making up rules (e.g. "+ never loses
> information") to deal with an aspect of the design where a *choice* must be
> made. This may reflect the Zen of Python's "In the face of ambiguity,
> refuse the temptation to guess." But really, that's a pretty silly rule
> (truly, they aren't all winners). Good interface design constantly makes
> choices in ambiguous situations, because the alternative is constantly
> asking, and that's just annoying.
>
> We have a plethora of examples (in fact, almost all alternatives
> considered) of situations related to dict merging where a choice is made
> between conflicting values for a key, and it's always the value further to
> the right that wins: from d[k] = v (which overrides the value when k is
> already in the dict) to d1.update(d2) (which lets the values in d2 win),
> including the much lauded {**d1, **d2} and even plain {'a': 1, 'a': 2} has
> a well-defined meaning where the latter value wins.
>
> Yeah. And I'm fine with the behavior for update because the name itself is
descriptive; we're spelling out, in English, that we're update-ing the
thing it's called on, so it makes sense to have the thing we're sourcing
for updates take precedence.

Similarly, for dict literals (and by extension, unpacking), it's following
an existing Python convention which doesn't contradict anything else.

Overloading + lacks the clear descriptive aspect of update that describes
the goal of the operation, and contradicts conventions (in Python and
elsewhere) about how + works (addition or concatenation, and a lot of
people don't even like it doing the latter, though I'm not that pedantic).

A couple "rules" from C++ on overloading are "*Whenever the meaning of an
operator is not obviously clear and undisputed, it should not be
overloaded.* *Instead, provide a function with a well-chosen name.*"
and "*Always
stick to the operator’s well-known semantics".* (Source:
https://stackoverflow.com/a/4421708/364696 , though the principle is
restated in many other places). Obviously the C++ community isn't perfect
on this (see iostream and <</>> operators), but they're otherwise pretty
consistent. + means addition, and in many languages including C++ strings,
concatenation, but I don't know of any languages outside the "esoteric"
category that use it for things that are neither addition nor
concatenation. You've said you don't want the whole plethora of set-like
behaviors on dicts, but dicts are syntactically and semantically much more
like sets than sequences, and if you add + (with semantics differing from
both sets and sequences), the language becomes less consistent.

I'm not against making it easier to merge dictionaries. But people seem to
be arguing that {**d1, **d2} is bad because of magic punctuation that
obscures meaning, when IMO:

     d3 = d1 + d2

is obscuring meaning by adding yet a third rule for what + means,
inconsistent with both existing rules (from both Python and the majority of
languages I've had cause to use). A named method (class or instance) or
top-level function (a la sorted) is more explicit, easier to look up (after
all, the major complaint about ** syntax is the difficulty of finding the
documentation on it). It's also easier to make it do the right thing; d1 +
d2 + d3 + ... dN is inefficient (makes many unnecessary temporaries),
{**d1, **d2, **d3, ..., **dN} is efficient but obscure (and not subclass
friendly), but a varargs method like dict.combine(d1, d2, d3, ..., dN) (or
merge, or whatever; I'm not trying to bikeshed) is correct, efficient, and
most importantly, easy to look up documentation for.

I occasionally find it frustrating that concatenation exists given the
wealth of Schlemiel the Painter's algorithms it encourages, and the
"correct" solution for combining sequences (itertools.chain for general
cases, str.join/bytes.join for special cases) being less obvious means my
students invariably use the "wrong" tool out of convenience (and it's not
really wrong in 90% of code where the lengths are always short, but then
they use it where lengths are often huge and suffer for it). If we're going
to make dict merging more convenient, I'd prefer we make the obvious,
convenient solution also the one that doesn't encourage non-scalable
anti-patterns.

As to why raising is worse: First, none of the other situations I listed
> above raises for conflicts. Second, there's the experience of str+unicode
> in Python 2, which raises if the str argument contains any non-ASCII bytes.
> In fact, we disliked it so much that we changed the language incompatibly
> to deal with it.
>

Agreed, I don't like raising. It's consistent with + (the only argument in
favor of it really), but it's a bad idea, for all the reasons you mention.

- Josh Rosenberg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190306/49f8d382/attachment.html>