[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Thu Dec 7 10:53:12 EST 2017

On Mon, Dec 4, 2017 at 7:52 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> To Jeff's point re this example:
>
> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})
>
> In [3]: df
> Out[3]:
>     A    B
> 0 NaN  NaN
> 1 NaN  0.0
>
> In [4]: df.sum()
> Out[4]:
> A    NaN
> B    0.0
> dtype: float64
>
> By adding a function which behaves in this way, but with a different
> name, we keep the behavior available to the discerning user for whom
> this distinction is meaningful. For other users, for whom this is not
> meaningful, we give df.sum() the same meaning as df.sum().fillna(0).
>
> It's hard to predict which choice will cause the most or least harm to
> users. In either case, we cannot spare our users the expectation of
> some education about the behavior in the presence of missing (or no)
> data. My guess is that the all-NA -> 0 behavior does the least harm by
> default to the average user, because aggregates used in computations
> like weighted sums will not propagate NaNs.
>
> If we need to bump to 0.22.0 to resolve the matter and add the new
> function for Option 2 (in the event that we make Option 1 the behavior
> of sum, which is my preference), that seems OK. If there are users
> that are unsatisfied with the new behavior, we can at least defend
> ourselves with the example set by NumPy's np.nansum and R's sum with
> na.rm=T. Having the alternative method available for Option 2 IMHO
> should be sufficient to satisfy such demanding users.
>
> - Wes
>
> On Mon, Dec 4, 2017 at 5:17 PM, Nathaniel Smith <njs at pobox.com> wrote:
> > On Mon, Dec 4, 2017 at 9:12 AM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>> If I understand correctly, you have in mind a replacement for groupby
> >> such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non-
> >> observed categories
> >>
> >> No, I am proposing to add a new aggregation method (an alternative to
> >> "sum"). So something like
> >>
> >> s.groupby(...).total()
> >>
> >> or
> >>
> >> s.groupby(...).null_sum()
> >>
> >> (names are hard)
> >
> > Another spelling to consider would be something like
> sum(skipna="if_any_valid")
> >
> > -n
> >
> > --
> > Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

In an effort to get things rolling on this, here's an attempt to summarize.

The majority (not unanimous) preference is for Option 1: Empty / all-NA sum
to
0. SUM([]) = SUM([NA]) = 0. IIUC, Jeff prefers option 2 or 3. Jon and Chris
prefer option 2. Nathaniel prefers options 1.

This means we have two things to sort out before we can make a release:

1. Design and implement option 1 (including the alternative for returning
NA)
2. Decide on the next releases version.

I've opened https://github.com/pandas-dev/pandas/issues/18678 for the first
item, if anyone wants to weigh in there.

For the second item, see
https://github.com/pandas-dev/pandas/issues/18244#issuecomment-350000655

Thanks,

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171207/ef4de22f/attachment.html>