[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Wes McKinney wesmckinn at gmail.com
Mon Dec 4 20:52:38 EST 2017


To Jeff's point re this example:

In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})

In [3]: df
Out[3]:
    A    B
0 NaN  NaN
1 NaN  0.0

In [4]: df.sum()
Out[4]:
A    NaN
B    0.0
dtype: float64

By adding a function which behaves in this way, but with a different
name, we keep the behavior available to the discerning user for whom
this distinction is meaningful. For other users, for whom this is not
meaningful, we give df.sum() the same meaning as df.sum().fillna(0).

It's hard to predict which choice will cause the most or least harm to
users. In either case, we cannot spare our users the expectation of
some education about the behavior in the presence of missing (or no)
data. My guess is that the all-NA -> 0 behavior does the least harm by
default to the average user, because aggregates used in computations
like weighted sums will not propagate NaNs.

If we need to bump to 0.22.0 to resolve the matter and add the new
function for Option 2 (in the event that we make Option 1 the behavior
of sum, which is my preference), that seems OK. If there are users
that are unsatisfied with the new behavior, we can at least defend
ourselves with the example set by NumPy's np.nansum and R's sum with
na.rm=T. Having the alternative method available for Option 2 IMHO
should be sufficient to satisfy such demanding users.

- Wes

On Mon, Dec 4, 2017 at 5:17 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Mon, Dec 4, 2017 at 9:12 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>> If I understand correctly, you have in mind a replacement for groupby
>> such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non-
>> observed categories
>>
>> No, I am proposing to add a new aggregation method (an alternative to
>> "sum"). So something like
>>
>> s.groupby(...).total()
>>
>> or
>>
>> s.groupby(...).null_sum()
>>
>> (names are hard)
>
> Another spelling to consider would be something like sum(skipna="if_any_valid")
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org


More information about the Pandas-dev mailing list