[Pandas-dev] GroupBy Overhaul Proposal

Pietro Battiston me at pietrobattiston.it
Tue Jul 17 03:59:33 EDT 2018


Quick reply on a couple of points.

Il giorno lun, 16/07/2018 alle 19.45 -0700, William Ayd ha scritto:
> [...]
> Even
> if I’m overly concerned about that, I don’t think there’s a simple
> explanation to when these should differ, which is again why I think
> it’s a mistake to offer two very similar but actually slightly
> different ways of going about that calculation.

In fact, my preference for keeping apply is pretty weak as long as
there are alternatives that cover each of its use cases. But again, I'm
not sure this is true.


> > I understand the concern about the sum of A being called A, which
> > is
> > bad, but I would never want "Sum of A" to appear in my DataFrame. I
> > think this is the typical task to be solved through a MultiIndex,
> > consistently with .agg().
> 
> The problem with that is we have a variety of issues that are trying
> to work around the MultiIndex columns being returned. #18366 is
> probably the main issue (with 15 upvotes!),


Unless I'm wrong, #18366 is orthgonal to what we are discussing:
unnamed lambdas would remain unnamed lambdas.
(And the obvious solution to my eyes is used named methods instead)

> but you’ll also see this loosely manifested in #20241, #19978 and
> potentially quite a few more. 

I think we do need a better ability to do in-line renaming of
MultiIndexed DataFrames, regardless of whether they come from
groupby().

> What is the apprehension with something like “Sum of A”? I’m not tied
> to that naming per se, but it at least mimics Excel and therefore
> isn’t that farfetched of a solution.

Problems I have with "Sum of A":
- if, after creating all my columns, I want to e.g. select all columns
that contain sums, I need to do some sort of "df[[col if
col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]"
- it would be the only case in pandas in which we decide how to call a
column on behalf of the user
- ... and this unexpected behavior is introduced to solve a relatively
specific case of aggregation (1 column -> 1 scalar)
- if one wants to allow the user to name the columns according to her
taste, it's pretty simple to introduce an argument which takes a string
to be .format()ted with the name of the column (or even of the method),
e.g. name="Sum of {}"
- ... although it is actually pretty simple to just do
df.columns = "Sum of " + df.columns
- agg already returns MultiIndexes (when passed multiple functions)
- we would be following Excel as example :-D

By the way, despite some related issues, I still think tuples can be
first class citizens of flat indexes. So if one doesn't like
MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be
a label in a regular index.

Pietro


More information about the Pandas-dev mailing list