[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Fri Dec 8 19:17:37 EST 2017

>From Stephan Hoyer <shoyer at gmail.com>
> As many of us have argued, it is quite surprising for sum([], skipna=True) and sum([NaN], skipna=True) to differ.
I agree whole heartedly for this point. However, these should simply be NaN and not 0.
Otherwise you have inconsistency between other reduction operations, e.g. .min(), .mean() and so on.
> Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs.
Here's where we differ. skipna=True does not mean, let's remove the NaN's and then computethe operation, rather it means, ignore the NaN's in computing the operation. These are distinctand the crux of NaN propagation. This is simply a practical view of things.

>From Tom's response above
> In [3]: pd.Series([1, np.nan]).sum()
Out[3]: 1.0

This is of course exactly the purpose of pandas. ignoring NaNs (skipna=True) is a very sensible default.Sure one could always mask the NaN's themselves and do anything, but again I WILL belabor the point. Pandas
is meant to be obvious and sensible.

Making all-NaN columns do something different from mostly NaN columns would be a completely odd state of affairs.
This would be special casing all-NaN. Why would we want to add special cases?
Finally, we have a very very limited response of users / developers here (in this thread). I could be completely wrong, but I suspect many users have been *relatively* happy with pandas choices over the years. Sure we sometimes make decision that turn out to be wrong, and we do change them.
In this case I am raising my hand for all of the happy users, many of whom may not have commented here.

Jeff

    On Friday, December 8, 2017, 2:24:55 PM EST, Stephan Hoyer <shoyer at gmail.com> wrote:  

 On Fri, Dec 8, 2017 at 4:20 AM Jeff Reback via Pandas-dev <pandas-dev at python.org> wrote:

Pandas is all about propagating NaN's in a reliable and predictable way.
Folks do a series of calculations, preserving NaNs.

Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs.
As many of us have argued, it is quite surprising for sum([], skipna=True) and sum([NaN], skipna=True) to differ.

I would argue that the folks who want guaranteed zero for all-NaN, can simply fill first. The reverse operation is simply

not possible, nor desired in any actual real world scenario.

I think this is a little strong. As Tom points out, we could add another keyword option to sum, but even without that there are plenty of one-liners to achieve the version of sum() where all-NaN/empty inputs result in NaN.
For example: df.count() * df.mean() 
As for a decent compromise, option 3 is almost certainly the best option, where we revert the sum([]) == NA to be 0.

Yes, we could choose this if we wanted to differ breaking changes until a later release. But I think it is strictly inferior to either options (1) or (2), both of which are consistent in their own way. 
Making a radical departure from status quo (e.g. option 1) should have considered debate and not be 'rushed' in as a quick 'fix' to a supposed problem.

We have been debating this for quite some time already (weeks, months?). Nearly everyone who cares has chimed in, including all active core developers. I think it is fair to say that most (but not all) of us think option 1 is the most sensible choice.  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171209/0a096d8d/attachment.html>