[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Tom Augspurger tom.augspurger88 at gmail.com
Sat Dec 2 08:59:36 EST 2017


On Sat, Dec 2, 2017 at 3:02 AM, Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Hi Joris,
>
> Il giorno ven, 01/12/2017 alle 02.09 +0100, Joris Van den Bossche ha
> scritto:
> > [...]We see three different options for the default behaviour of sum
> > for
> > those two cases of empty and all-NA series:
> >
> > Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0
> >
> > Behaviour of pandas < 0.21 + bottleneck installed
> > Consistent with NumPy, R, MATLAB, etc. (given you use the variant
> > that is NA aware: nansum for numpy, na.rm=TRUE for R, ...)
> >
> > Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA
> >
> > The behaviour that is introduced in 0.21.0
> > Consistent with SQL (although often (rightly or not) complained
> > about)
> >
> > Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA
> >
> > Behaviour of pandas < 0.21 (without bottleneck installed)
> > A practicable compromise (having SUM([NA]) keep the information of
> > NA, while SUM([]) = 0 does not introduce NAs when there were no in
> > the data)
> > But somewhat inconsistent and unique to pandas ?
>
>
> I'm 100% sure I want sum([]) to return 0, not NA. It's not just more
> elegant, it also makes much more sense in my daily workflow (e.g. if I
> randomly split in samples, groupby, sum, and then sum the results for
> each group across samples, I want the same result as if I just groupby
> and sum, without splitting).
>
> I'm also 99% sure I want sum([NA]) to return the same that sum([0, NA])
> returns, following the same arguments, and possibly more.
>
> I would probably like to have sum([0, NA]) (and sum([NA]) both return
> NA (I was initially surprised by this deviation from numpy) but since
> you don't mention this, it's apparently not open for discussion (and I
> understand, given that it would break a lot of code).
>
> So assuming my interpretation is correct, I am enormously in favour of
> your option 1.
>
> While we're talking about this: I guess same applies to
> pd.Series([]).prod() which should (in my view) return 1?
>


Yes, I think there's agreement that if we go with option 1, we would also
want to make prod behave similarly, with 1 as the unit instead of 0.

Tom



> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171202/75d9a6df/attachment-0001.html>


More information about the Pandas-dev mailing list