[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Sat Dec 2 20:32:38 EST 2017

On Thu, Nov 30, 2017 at 5:09 PM, Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
> Options
>
> We see three different options for the default behaviour of sum for those
> two cases of empty and all-NA series:
>
>
> Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0
>
> Behaviour of pandas < 0.21 + bottleneck installed
>
> Consistent with NumPy, R, MATLAB, etc. (given you use the variant that is NA
> aware: nansum for numpy, na.rm=TRUE for R, ...)
>
>
> Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA
>
>
> The behaviour that is introduced in 0.21.0
>
> Consistent with SQL (although often (rightly or not) complained about)
>
>
> Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA
>
>
>
> Behaviour of pandas < 0.21 (without bottleneck installed)
>
> A practicable compromise (having SUM([NA]) keep the information of NA, while
> SUM([]) = 0 does not introduce NAs when there were no in the data)
>
> But somewhat inconsistent and unique to pandas ?
>
>
> We have to stress that each of those choices can be preferable depending on
> the use case and has its advantages and disadvantages. Some might be more
> mathematical sound, others might preserve more information about having
> missing data, each can be be more consistent with a certain ecosystem, … It
> is clear that there is no ‘best’ option for all case.

I understand you want to try to avoid bikeshedding here, but it's hard
to discuss without any rationales at all :-).

I am baffled by the idea that sum([]) would return NaN. I'm sure
there's some benefits, I just can't think of any. (OK, SQL does it,
but SQL contains all kinds of indefensible things...)

I am baffled by the idea that sum([]) and sum([NaN], skipna=True)
would return different values. I'm sure there's some benefits, I just
can't think of any.

Can someone who does understand the trade-offs explain?

The email says that sum([NaN], skipna=True) returning NaN is
"preserving information", and briefly skimming github issue #9422 I
see some arguments that "NA should propagate", but I don't understand
why it's crucial that NaN should propagate when you have [NaN] but
that it shouldn't propagate for [NaN, 1]. I was particularly confused
by this comment from Jef, which I will paraphrase in a rude way to
make my point (click the link for the original text):

https://github.com/pandas-dev/pandas/issues/9422#issuecomment-169508202
> see the point is pandas [is intentionally designed not to propagate nans in sum unless you specifically propagate them], so we basically use nansum like behavior for everthing. The issues is if you ONLY have nans then numpy says it should be 0, because nansum 'skips' nans.
> But in pandas that is completely misleading and lossy, because nans by definition propogate (unless you specifically dont' propogate them).

So... pandas chooses not to propagate nans by default, and this is
misleading and lossy because nans should propagate by default? I
actually agree with this, but this is an argument that skipna=False
should be the default, not that there should be a special case where
NaN propagation gets flipped on and off depending on the values in an
array.

I guess one argument for option 3 could be that skipna=True was a
mistake -- it makes it easier to get *some* result but also increases
the chance of silently getting garbage -- but now we're stuck with it,
and at least option 3 lets us inch towards the skipna=False behavior?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org