[Python-Dev] Proposal: dict.with_values(iterable)

Tue Apr 23 11:33:07 EDT 2019

On 23Apr2019 0008, Inada Naoki wrote:
> On Tue, Apr 23, 2019 at 2:54 PM Steve Dower <steve.dower at python.org> wrote:
>>
>> On 22Apr2019 2143, Inada Naoki wrote:
>>> On Tue, Apr 23, 2019 at 11:30 AM Steve Dower <steve.dower at python.org> wrote:
>>>>
>>>> Or possibly just "dict(existing_dict).update(new_items)".
>>>>
>>>
>>> Do you mean .update accepts values tuple?
>>> I can't think it's
>>
>> Not sure what you were going to go on to say here, but why not?
> 
> Sorry, I sent mail without finishing.
> 
> dict.update() has too many overloading.
> Adding values_tuple is impossible without breaking backward compatibility.
> 
> But I think you're saying about items_sequence, not values.

Right. I'm specifically trying to avoid changing public APIs at all 
(including adding anything new, if possible) by identifying suitable 
patterns that we can handle specially to provide a transparent speed 
improvement.

>> If it's a key-sharing dict, then all the keys are strings. We know that
>> when we go to do the update, so we can intern all the strings (going to
>> do that anyway) and then it's a quick check if it already exists. If
>> it's a regular dict, then we calculate hashes as normal. Updating the
>> value is just a decref, incref and assignment.
> 
> There are some problem.
> 
> 1. Searching hash table is not zero-cost, comparing to appending to sequence.
>     This cost is very near to building new hash tables.

If we know that you're sharing keys with the new items then we can skip 
the search. This was my point about the d2 = copy(d1); d2.update(zip(d2, 
values)) idea:

def update(self, items):
     if isinstance(items, ZipObject): # whatever the type is called
         if are_sharing_keys(self, items.sequence_1):
             # fast update from iter(items.sequence_2)
             return
     # regular update from iter(items)

Totally transparent and encourages composition of existing builtins. 
It's a bit of a trick and may not be as obvious as a new method, but 
it's backwards compatible at least as far as ordered dicts (which is a 
requirement of any of these approaches anyway, yes?)

> 2. In my proposal, main user is csv.DictReader or sql.DictCursor.
>     They parse only values on each rows.   So they need to use map.

In that case, use a private helper. _csv already has a native module. We 
don't need to add new public APIs for internal optimisations, provided 
there is a semantically equivalent way to do it without the internal API.

> 3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
>     methods.  There is no obvious place to start using key-sharing dict.

See my reply to Glenn, but potentially fromkeys() could start with the 
key-sharing dict and then copy()/dict() could continue sharing it 
(hopefully they already do?).

> That's why I proposed specific method / function for specific purpose.
> 
>>
>> Note that it .update() would probably require a dict or key/value tuples
>> here - but if you have the keys in a tuple already then zip() is going
>> to be good enough for setting it (in fact, zip(existing_dict,
>> new_values) should be fine, and we can internally special-case that
>> scenario, too).
> 
> If *CPython* specialized dict(zip(dict, values)),  it still be CPython
> implementation detail.
> Do you want recommend using such CPython hacky optimization?
> Should we use such optimization in stdlib, even if it will be slower
> than dict(zip(keys_tuple, values)) on some other Python implementations?

We do "hacky" optimisations everywhere :) The point of the runtime is to 
let users write code that works and we do the effort behind the scenes 
to make it efficient. We're not C - we're here to help our users.

The point is that it will work on other implementations - including 
previous versions of CPython - and those are free to optimise it however 
they like.

> Or do you propose making dict(zip(dict, values)) optimization as
> language specification?

Definitely not! It's just a pattern that we have the ability to 
recognize and optimize at runtime, so why not do it?

> One obvious advantage of having DictBuilder is it is for specific
> purpose.  It has at least same performance to dict(zip(keys, values))
> on all Python implementations.
> Libraries like csv parser can use it without worrying about its performance
> on Python other than CPython.

A singular purpose isn't necessarily an obvious advantage. We're better 
off with generic building blocks that our users can compose in ways that 
were originally non-obvious (and then as patterns emerge we can look at 
ways to simplify or formalise them).

>> (Randomizing side note: is this scenario enough to make a case for a
>> built-in data frame type?)
> 
> https://xkcd.com/927/

Yep. The difference is that as the language team, our standard wins by 
default ;)

(For those who don't click links, it's pointing at the "let's make a new 
standard" XKCD comic)

>> * when you only d2.update existing keys, no need to rebuild the table
>> * a duplicated key overwrites multiple times - what else are you going
>> to do?
> 
> But all keys should be looked up.  It is very similar overhead to rebuilding
> hash table.

See my suggestion above - when we know the keys are shared, we can skip 
the lookup, and there are ways we can detect that they're shared. 
(Perhaps it is also faster to start by assuming they are shared and test 
each one, rather than assuming they are unshared? That might be worth 
testing.)

Cheers,
Steve