[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Mon Dec 4 11:05:19 EST 2017

We have been discussing this amongst the pandas core developers for
some time, and the general consensus is to adopt Option 1 (sum of
all-NA or empty is 0) as the behavior for sum with skipna=True.

In a groupby setting, and with categorical group keys, the issue
becomes a bit more nuanced -- if you group by a categorical, and one
of the categories is not observed at all in the dataset, e.g:

s.groupby(some_categorical).sum()

This change will necessarily yield a Series containing no nulls -- so
if there is a category containing no data, then the sum for that
category is 0.

For the sake of algebraic completeness, I believe we should introduce
a new aggregation method that performs Option 2 (equivalent to what
pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
yields NA.

So the TL;DR is:

* We should prepare a 0.21.1 release in short order with Option 1
implemented for sum() (always 0 for empty/all-null) and prod() (1,
respectively)
* Add a new method for Option 2, either in 0.21.1 or in a later release

We should probably alert the long GitHub thread that this discussion
is taking place before we cut the release. Since GitHub comments can
be permanently deleted at any time, I think it's better for
discussions about significant issues like this to take place on the
permanent public record.

Thanks
Wes

On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it> wrote:
> Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
>> [...]
>
> I think Nathaniel just expressed my thoughts better than I was/would be
> able to!
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev