[Pandas-dev] GroupBy Overhaul Proposal

Mon Jul 16 22:45:24 EDT 2018

Thanks Pietro for your feedback - very much appreciated!

> I would not worry too much about the fact that apply's performance on
> user-provided functions is bad (as long as it's documented), or that
> sum() returns different results from .apply(sum) (again, as long as
> it's documented in sum()'s docstring).

Perfectly fine to ignore performance for now, but I disagree on the second point you make. To a core developer it makes perfect sense that .sum() and .apply(sum) may return different results, but I don’t think that is as apparent to newcomers or just casual users of pandas. In fact I’d worry about a newcomer thinking “why bother with other methods when I can just send everything on through apply?” Even if I’m overly concerned about that, I don’t think there’s a simple explanation to when these should differ, which is again why I think it’s a mistake to offer two very similar but actually slightly different ways of going about that calculation.

> So my (not very informed) opinion is that we could just simplify
> .apply() a lot, reducing it to few simple rules/cases on the kind of
> output returned by the function, to be clearly documented, without
> suppressing it.

Totally agree here. I think there’s a lot of overlap between .apply and other methods which obfuscates the need for the former. One example was cited above, but another thing to consider is its overlap with .agg. IMO the “cleanest” use of apply is sending it a function which reduces to a scalar, but in that case you could arguably just use .agg. The other uses cases would cover Series, DataFrame, collections, etc… and it kind of “just works” with those, but I think those types of objects are impossible to make guarantees about how to properly piece back together. DataFrames of differing dimensions can easily create sparse objects (which may or may not be the intention) and for things like collections I’d question how we will walk the tightrope of expectations if / when we get some Extension Arrays in place that support that as first class objects in pandas.

> I miss the technical details, but I don't think we should force the
> output of a DataFrame.groupby()[col].anything() to be a DataFrame...
> 
> ... However there might be a lot of scope for
> code simplification by having the above case _implemented_ as a
> DataFrameGroupBy (or just code in .groupby()), of which we then extract
> the column.

Yea that’s a valid point. The suggestion here may be extreme, but with your last statement there I think we are aligned on high level the intention and how it could simplify the code. 

> I understand the concern about the sum of A being called A, which is
> bad, but I would never want "Sum of A" to appear in my DataFrame. I
> think this is the typical task to be solved through a MultiIndex,
> consistently with .agg().

The problem with that is we have a variety of issues that are trying to work around the MultiIndex columns being returned. #18366 is probably the main issue (with 15 upvotes!), but you’ll also see this loosely manifested in #20241, #19978 and potentially quite a few more. 

What is the apprehension with something like “Sum of A”? I’m not tied to that naming per se, but it at least mimics Excel and therefore isn’t that farfetched of a solution. From an end user perspective I can see the big gripe that we kind of force a MultiIndex column on them when they often don’t have that to begin with, and it just adds more complexity and method chaining to their pipeline. Something like “Sum of A” (or whatever else really) could maintain the original dimensions of the columns being used while also being a solution that might work across all of the various aggregation / transformation methods and acceptable arguments.

> I never used it indeed... but if it's really just a matter of transpose
> -> operate -> transpose, couldn't we just do this under the hood (and
> maybe warn the user in the docs about performance/dtypes mess)?

That could be an option as well. Curious to hear what others think.

- Will

> On Jul 16, 2018, at 4:11 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> 
> Hi Will,
> 
> there might be parts of your document I don't entirely understand, but
> I definitely appreciate the desire to clean up the groupby module, and
> have some comments on what I (think I) understood.
> 
> 
> Il giorno lun, 09/07/2018 alle 17.18 -0700, William Ayd via Pandas-dev
> ha scritto:
>> Hi All,
>> 
>> I’ve been thinking through what a redesigned GroupBy module could
>> look like in 1.0. The main problems I am trying to address are:
>> 
>>   - The current module is relatively convoluted, making contribution
>> and debugging challenging
>>   - Behavior is sometimes non-obvious and buggy
>> (see here, here and here as some examples) AND
>>   - We violate the mantra of there being “only one obvious way to do
>> things”
>> 
>> Along those lines, here were four things I thought could be of
>> immense value:
>> • Removal of apply method
> 
> I would not worry too much about the fact that apply's performance on
> user-provided functions is bad (as long as it's documented), or that
> sum() returns different results from .apply(sum) (again, as long as
> it's documented in sum()'s docstring).
> 
> What I think we should definitely avoid is _any_ case of
> 
> apply(a_func)
> 
> giving different results from
> 
> apply(lambda x : a_func(x))
> 
> ... that is, inference can be made on the result on a function, but not
> on the function itself.
> 
> However, this is an issue not specifically related to .apply() - see
> #17035.
> 
> So my (not very informed) opinion is that we could just simplify
> .apply() a lot, reducing it to few simple rules/cases on the kind of
> output returned by the function, to be clearly documented, without
> suppressing it.
> 
> By the way, assuming we keep apply(), I really think it shouldn't be
> too hard to avoid evaluating the first chunk twice.
> 
> Apart from this, I was assuming that apply() covered some cases that no
> other aggregation method covers (e.g. when func returns df but of
> different shape than original chunk)... but I might be wrong.
> 
> 
>> • Removal of DataFrameGroupBy and SeriesGroupBy classes
> 
> I miss the technical details, but I don't think we should force the
> output of a DataFrame.groupby()[col].anything() to be a DataFrame; and
> most importantly, force the "func" in
> 
> DataFrame.groupby()[col].apply(func)
> 
> to accept DataFrame chunks. However there might be a lot of scope for
> code simplification by having the above case _implemented_ as a
> DataFrameGroupBy (or just code in .groupby()), of which we then extract
> the column.
> 
>> • Explicit default column naming
> I understand the concern about the sum of A being called A, which is
> bad, but I would never want "Sum of A" to appear in my DataFrame. I
> think this is the typical task to be solved through a MultiIndex,
> consistently with .agg().
> 
>> • Removal of axis argument
> 
> I never used it indeed... but if it's really just a matter of transpose
> -> operate -> transpose, couldn't we just do this under the hood (and
> maybe warn the user in the docs about performance/dtypes mess)?
> 
> Pietro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180716/daa2db79/attachment-0001.html>