[Python-Dev] Proposal: dict.with_values(iterable)

Inada Naoki songofacandy at gmail.com
Tue Apr 23 03:08:01 EDT 2019


On Tue, Apr 23, 2019 at 2:54 PM Steve Dower <steve.dower at python.org> wrote:
>
> On 22Apr2019 2143, Inada Naoki wrote:
> > On Tue, Apr 23, 2019 at 11:30 AM Steve Dower <steve.dower at python.org> wrote:
> >>
> >> Or possibly just "dict(existing_dict).update(new_items)".
> >>
> >
> > Do you mean .update accepts values tuple?
> > I can't think it's
>
> Not sure what you were going to go on to say here, but why not?

Sorry, I sent mail without finishing.

dict.update() has too many overloading.
Adding values_tuple is impossible without breaking backward compatibility.

But I think you're saying about items_sequence, not values.

>
> If it's a key-sharing dict, then all the keys are strings. We know that
> when we go to do the update, so we can intern all the strings (going to
> do that anyway) and then it's a quick check if it already exists. If
> it's a regular dict, then we calculate hashes as normal. Updating the
> value is just a decref, incref and assignment.

There are some problem.

1. Searching hash table is not zero-cost, comparing to appending to sequence.
   This cost is very near to building new hash tables.

2. In my proposal, main user is csv.DictReader or sql.DictCursor.
   They parse only values on each rows.   So they need to use map.

3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
   methods.  There is no obvious place to start using key-sharing dict.

That's why I proposed specific method / function for specific purpose.

>
> Note that it .update() would probably require a dict or key/value tuples
> here - but if you have the keys in a tuple already then zip() is going
> to be good enough for setting it (in fact, zip(existing_dict,
> new_values) should be fine, and we can internally special-case that
> scenario, too).

If *CPython* specialized dict(zip(dict, values)),  it still be CPython
implementation detail.
Do you want recommend using such CPython hacky optimization?
Should we use such optimization in stdlib, even if it will be slower
than dict(zip(keys_tuple, values)) on some other Python implementations?

Or do you propose making dict(zip(dict, values)) optimization as
language specification?

One obvious advantage of having DictBuilder is it is for specific
purpose.  It has at least same performance to dict(zip(keys, values))
on all Python implementations.
Libraries like csv parser can use it without worrying about its performance
on Python other than CPython.


> I'd assumed the benefit was in memory usage after
> construction, rather than speed-to-construct, since everyone keeps
> talking about "key-sharing dictionaries" and not "arrays" ;)

Both is important.  I had talked about non key-sharing dict.

> (Randomizing side note: is this scenario enough to make a case for a
> built-in data frame type?)

https://xkcd.com/927/


> >> My primary concern is still to avoid making CPython performance
> >> characteristics part of the Python language definition. That only makes
> >> it harder for alternate implementations.
> >
> > Note that this proposal is not only for key sharing dict:
> >
> > * We can avoid rebuilding hash table again and again.
> > * We can avoid checking duplicated keys again and again.
> >
> > These characteristics are not only for Python, but for all mapping
> > implementations using hash table.
>
> I believe all of these are met by making d2=dict(d1) construct a dict d2
> that shares keys with d1 by default. Can you show how they are not?

If you want only copy, it's same.

>
> * when you only d2.update existing keys, no need to rebuild the table
> * a duplicated key overwrites multiple times - what else are you going
> to do?

But all keys should be looked up.  It is very similar overhead to rebuilding
hash table.

> This is already easiest, fastest, uses the least memory and is
> most consistent with every other form of setting dict items. Why
> complicate things by checking them? Let the caller do it

As I wrote above, it is:

* slower than my proposal.
* no obvious place to start using key sharing dict.


-- 
Inada Naoki  <songofacandy at gmail.com>


More information about the Python-Dev mailing list