[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Mon Dec 4 13:11:50 EST 2017

> We have been discussing this amongst the pandas core developers for
some time, and the general consensus is to adopt Option 1 (sum of
all-NA or empty is 0) as the behavior for sum with skipna=True.

Actually, no there has not been general consensus among the core developers.

Everyone loves to say that

s.sum([NA]) == 0 makes a ton of sense, but then you have my simple example
from original issue, which Nathaniel did quote and I'll repeat here (with a
small modification):

In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})

In [3]: df
Out[3]:
    A    B
0 NaN  NaN
1 NaN  0.0

In [4]: df.sum()
Out[4]:
A    NaN
B    0.0
dtype: float64

Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact
that you have 0
present in B. If you conflate these, you then have a situation where I do
not
know that I had a valid value in B.

Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose
information. No argument has been presented at all why this should not hold.

>From [4] it follows that sum([NA]) must be NA.

I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA
is more consistent with
the rest of pandas (IOW *every* other operation on an empty Series returns
NA).

> * We should prepare a 0.21.1 release in short order with Option 1
implemented for sum() (always 0 for empty/all-null) and prod() (1,
respectively)

I can certainly understand pandas reverting back to the de-facto state of
affairs prior
to 0.21.0, which would be option 3, but a radical change on a minor release
is
not warranted at all. Frankly, we only have (and are likely to get) even a
small
fraction of users opinions on this whole matter.

Jeff

On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney <wesmckinn at gmail.com> wrote:

> We have been discussing this amongst the pandas core developers for
> some time, and the general consensus is to adopt Option 1 (sum of
> all-NA or empty is 0) as the behavior for sum with skipna=True.
>
> In a groupby setting, and with categorical group keys, the issue
> becomes a bit more nuanced -- if you group by a categorical, and one
> of the categories is not observed at all in the dataset, e.g:
>
> s.groupby(some_categorical).sum()
>
> This change will necessarily yield a Series containing no nulls -- so
> if there is a category containing no data, then the sum for that
> category is 0.
>
> For the sake of algebraic completeness, I believe we should introduce
> a new aggregation method that performs Option 2 (equivalent to what
> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
> yields NA.
>
> So the TL;DR is:
>
> * We should prepare a 0.21.1 release in short order with Option 1
> implemented for sum() (always 0 for empty/all-null) and prod() (1,
> respectively)
> * Add a new method for Option 2, either in 0.21.1 or in a later release
>
> We should probably alert the long GitHub thread that this discussion
> is taking place before we cut the release. Since GitHub comments can
> be permanently deleted at any time, I think it's better for
> discussions about significant issues like this to take place on the
> permanent public record.
>
> Thanks
> Wes
>
> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it>
> wrote:
> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
> >> [...]
> >
> > I think Nathaniel just expressed my thoughts better than I was/would be
> > able to!
> >
> > Pietro
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171204/d7020f63/attachment.html>