[Python-ideas] grouping / dict of lists

Mon Jul 2 03:43:17 EDT 2018

I made some heavy revisions to the PEP. Linking again for convenience.
https://github.com/selik/peps/blob/master/pep-9999.rst

Replying to Guido, Nick, David, Chris, and Ivan in 4 sections below.

[Guido]
On Fri, Jun 29, 2018 at 11:25 PM Guido van Rossum <guido at python.org> wrote:

> On Fri, Jun 29, 2018 at 3:23 PM Michael Selik <mike at selik.org> wrote:
>
>> On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum <guido at python.org>
>> wrote:
>>
>>> On a quick skim I see nothing particularly objectionable or
>>> controversial in your PEP, except I'm unclear why it needs to be a class
>>> method on `dict`.
>>>
>>
>> Since it constructs a basic dict, I thought it belongs best as a dict
>> constructor like dict.fromkeys. It seemed to match other classmethods like
>> datetime.now.
>>
>
> It doesn't strike me as important enough. Surely not every stdlib function
> that returns a fresh dict needs to be a class method on dict!
>

Thinking back, I may have chosen the name "groupby" first, following
`itertools.groupby`, SQL, and other languages, and I wanted to make a clear
distinction from `itertools.groupby`. Putting it on the `dict` namespace
clarified that it's returning a dict.

However, naming it `grouping` allows it to be a stand-alone function.

But I still think it is much better off as a helper function in itertools.
>
I considered placing it in the itertools module, but decided against
>> because it doesn't return an iterator. I'm open to that if that's the
>> consensus.
>>
>
> You'll never get consensus on anything here, but you have my blessing for
> this without consensus.
>

That feels like a success, but I'm going to be a bit more ambitious and try
to persuade you that `grouping` belongs in the built-ins. I revised my
draft to streamline the examples and make a clearer comparison with
existing tools.

[Nick]
On Sat, Jun 30, 2018 at 2:01 AM Nick Coghlan <ncoghlan at gmail.com> wrote:

> I'm not sure if the draft was updated since [Guido] looked at it, but it

> does mention that one benefit of the collections.Grouping approach is
> being able to add native support for mapping a callable across every
> individual item in the collection (ignoring the group structure), as
> well as for applying aggregate functions to reduce the groups to
> single values in a standard dict.
>
> Delegating those operations to the container API that way then means
> that other libraries can expose classes that implement the grouping
> API, but with a completely different backend storage model.
>

While it'd be nice to create a standard interface as you point out, my
primary goal is to create an "obvious" way for both beginners and experts
to group, classify, categorize, bucket, demultiplex, taxonomize, etc. I
started revising the PEP last night and found myself getting carried away
with adding methods to the Grouping class that were more distracting than
useful. Since the most important thing is to make this as accessible and
easy as possible, I re-focused the proposal on the core idea of grouping.

[Ivan, Chris, David]
On Sun, Jul 1, 2018 at 7:29 PM David Mertz <mertz at gnosis.cx> wrote:

> {k:set(v) for k,v in deps.items()}
> {k:Counter(v) for k,v in deps.items()}
>

I had dropped those specific examples in favor of generically "func(g)",
but added them back. Your discussion with Ivan and Chris showed that it was
useful to be specific.

[Chris]
On Sat, Jun 30, 2018 at 10:18 PM Chris Barker <chris.barker at noaa.gov> wrote:

> I'm really warming to the:
> Alternate: collections.Grouping
> version -- I really like this as a kind of custom mapping, rather than
> "just a function" (or alternate constructor) -- and I like your point that
> it can have a bit of functionality built in other than on construction.
>

I moved ``collections.Grouping`` to the "Rejected Alternatives" section,
but that's more like a "personal 2nd choices" instead of "rejected".

[...]
> __init__ and update would take an iterable of (key, value) pairs, rather
> than a single sequence.
>

I added a better demonstration in the PEP for handling that kind of input.
You have one of two strategies with my proposed function.

Either create a reverse lookup dict:
    d = {v: k for k, v in items}
    grouping(d, key=lambda k: d[k])

Or discard the keys after grouping:
    groups = grouping(items, key=lambda t: t[0])
    groups = {k: [v for _, v in g] for k, g in groups.items()}

While thinking of examples for this PEP, it's tempting to use
overly-simplified data. In practice, instead of (key, value) pairs, it's
usually either individual values or n-tuple rows. In the latter case,
sometimes the key should be dropped from the row when grouping, sometimes
kept in the row, and sometimes the key must be computed from multiple
values within the row.

[...] building up a data structure with word pairs, and a list of all the
> words that follow the pair in a piece of text. [...example code...]
>

I provided a similar example in my first draft, showing the creation of a
Markov chain data structure. A few folks gave the feedback that it was more
distracting from the PEP than useful. It's still there in the "stateful
key-function" example, but it's now just a few lines.

[...] if you are teaching, say data analysis with Python -- it might be
> nice to have this builtin, but if you are teaching "programming with
> Python" I'd probably encourage them to do it by hand first anyway :-)
>

I agree, but users in both cases will appreciate the proposed built-in.

On Sun, Jul 1, 2018 at 10:35 PM Chris Barker <chris.barker at noaa.gov> wrote:

> Though maybe list, set and Counter are the [aggregation collections] you'd
> want to use?
>

I've been searching the standard library and popular community libraries
for use of setdefault, defaultdict, groupby, and the word "group" or
"groups" periodically over the past year or so. I admit I haven't been as
systematic as maybe I should have been, but I feel like I've been pretty
thorough.

The majority of grouping uses a list. A significant portion use a set. A
handful use a Counter. And that's basically it. Sometimes there's a
specialized container class, but they are generally composed of a list,
set, or Counter. There may have been other types, but if it was
interesting, I think I'd have written down an example of it in my notes.

Most other languages with a similar tool have decided to return a mapping
of lists or the equivalent for that language. If we make that choice, we're
in good company.

[...]
> before making any decisions about the best API, it would probably be a
> good idea to collect examples of the kind of data that people really do
> need to group like this. Does it come in (key, value) pairs naturally? or
> in one big sequence with a key function that's easy to write? who knows
> without examples of real world use cases.
>

It may not come across in the PEP how much research I've put into this.
I'll some time to compile the evidence, but I'm confident that it's more
common to need a key-function than to have (key, value) pairs. I'll get
back to you soon(ish) with data.

-- Michael

PS. Not to bikeshed, but a Grouper is a kind of fish. :-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180702/558c3153/attachment-0001.html>