[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Sat Dec 2 04:02:34 EST 2017

Hi Joris,

Il giorno ven, 01/12/2017 alle 02.09 +0100, Joris Van den Bossche ha
scritto:
> [...]We see three different options for the default behaviour of sum
> for
> those two cases of empty and all-NA series:
> 
> Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0
> 
> Behaviour of pandas < 0.21 + bottleneck installed
> Consistent with NumPy, R, MATLAB, etc. (given you use the variant
> that is NA aware: nansum for numpy, na.rm=TRUE for R, ...)
> 
> Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA
> 
> The behaviour that is introduced in 0.21.0
> Consistent with SQL (although often (rightly or not) complained
> about)
> 
> Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA
>    
> Behaviour of pandas < 0.21 (without bottleneck installed)
> A practicable compromise (having SUM([NA]) keep the information of
> NA, while SUM([]) = 0 does not introduce NAs when there were no in
> the data)
> But somewhat inconsistent and unique to pandas ?

I'm 100% sure I want sum([]) to return 0, not NA. It's not just more
elegant, it also makes much more sense in my daily workflow (e.g. if I
randomly split in samples, groupby, sum, and then sum the results for
each group across samples, I want the same result as if I just groupby
and sum, without splitting).

I'm also 99% sure I want sum([NA]) to return the same that sum([0, NA])
returns, following the same arguments, and possibly more.

I would probably like to have sum([0, NA]) (and sum([NA]) both return
NA (I was initially surprised by this deviation from numpy) but since
you don't mention this, it's apparently not open for discussion (and I
understand, given that it would break a lot of code).

So assuming my interpretation is correct, I am enormously in favour of
your option 1.

While we're talking about this: I guess same applies to
pd.Series([]).prod() which should (in my view) return 1?

Pietro