[Pandas-dev] GroupBy Overhaul Proposal

Tue Jul 17 19:10:47 EDT 2018

> In fact, my preference for keeping apply is pretty weak as long as
> there are alternatives that cover each of its use cases. But again, I'm
> not sure this is true.

Just to clarify my position:

	1. .apply() + UDF reducing to a scalar should be replaceable with .agg() + same UDF (even though there are differences today…)
	2. .apply() + UDF returning Series / DataFrame / collection doesn’t have anything else to cover it

But with #2 above I think its dangerous to assume that .apply can always do the “right thing” with those types of inputs. We don’t make any assertions about the indexing / labeling of returned Series and DataFrames. As far as collections are concerned I’m not sure if there will be a clear answer on how to handle those assuming we start getting EAs that add first-class support for those. 

> Unless I'm wrong, #18366 is orthgonal to what we are discussing:
> unnamed lambdas would remain unnamed lambdas.
> (And the obvious solution to my eyes is used named methods instead)

I don’t think this is orthogonal. Your concern is valid on lambdas and I don’t know what the solution there is (perhaps some kind of keyword argument) but without getting tripped up on that I don’t think its immediately apparent that the returned object for a DataFrame with columns ‘a’, ‘b’, ‘c’ will have a single column when called as follows:

 - df.groupby(‘a’).agg(sum)
 - df.groupby(‘a’).agg({‘b’: sum, ‘c’: min})

Yet the following will yield a MultiIndex column:

 - df.groupby(‘a’).agg([sum])
 - df.groupby(‘a’).agg({‘b’: [sum], ‘c’: min})

If you reduce the returned columns to “‘sum’ of ‘b’” and “‘min’ of ‘c’” you can ensure that the returned columns have the same number of levels regardless of call signature, AND have the added bonus of not obfuscating what type of aggregation was performed with the former two examples. Of course the end user may ultimately decide that they don’t like those labels at all and completely override them, but that effort becomes much easier if they can make guarantees around the number of levels of the returned object (especially if it’s just one!).

> - if, after creating all my columns, I want to e.g. select all columns
> that contain sums, I need to do some sort of "df[[col if
> col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]”

Unless I am mistaken you would have to do something like "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum’)]” to get that to work. I don’t think that syntax really is that clean and it starts taking us down the path of advanced indexing for what may start off to the end user as a very simple aggregation exercise.

> - it would be the only case in pandas in which we decide how to call a
> column on behalf of the user

Well we have to do something to reduce ambiguity…I think a consistent naming convention and dimension for the columns across all invocations is strongly preferable to inserting a column level some of the time.

> - if one wants to allow the user to name the columns according to her
> taste, it's pretty simple to introduce an argument which takes a string
> to be .format()ted with the name of the column (or even of the method),
> e.g. name="Sum of {}"

Agreed. In my head I feel like this defaults to something like f”{fname} of {colname}” but gives the user potentially the option to override. By default keep the same number of levels as what is being passed in, though maybe None as an argument reverts to the old style behavior of simply inserting a new column index level.

> By the way, despite some related issues, I still think tuples can be
> first class citizens of flat indexes. So if one doesn't like
> MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be
> a label in a regular index.

You know better than I do here, but again I don’t think it makes for a good user experience to convert columns with one level into multiple levels after a GroupBy operation regardless of how you could subsequently access those values.

William Ayd
william.ayd at icloud.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180717/d230a4bb/attachment-0001.html>