[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Fri Dec 8 07:19:59 EST 2017

Using Tom's example
In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]})

In [2]: dfOut[2]:     A    B  C0 NaN  NaN  01 NaN  0.0  0
In [3]: df.sum()Out[3]: A    NaNB    0.0C    0.0dtype: float64

Pandas is all about propagating NaN's in a reliable and predictable way.Folks do a series of calculations, preserving NaNs. For examples
In [5]: df.sum() + 1Out[5]: A    NaNB    1.0C    1.0dtype: float64
makes it perfectly obvious that we have NaN preserving operations
Option 1 is essentially:
In [4]: df.fillna(0).sum()Out[4]: A    0.0B    0.0C    0.0dtype: float64

Using the same operation as [5], but showing all NaN sum to 0, we have have the situation where we are no longer NaN preserving. In any actual real world calculation this is a disaster and the worst possible scenario.
In [6]: df.fillna(0).sum() + 1Out[6]: A    1.0B    1.0C    1.0dtype: float64

Changing this behavior shakes the core tenants of pandas, suddenly we have a special casewhere NaN propagation is not important anymore and worse you may get wrong answers.
We have always consistently allowed reduction operations to return NaN (with the exception of count, which is actuallycounting non-nans). 
I would argue that the folks who want guaranteed zero for all-NaN, can simply fill first. The reverse operation is simplynot possible, nor desired in any actual real world scenario.
Pandas is not about strictly mathematical purity, rather about real world utility. 
As for a decent compromise, option 3 is almost certainly the best option, where we revert the sum([]) == NA to be 0.  This would putus back to pre-0.21.0 pandas without bottleneck, likely the biggest installed population. This optionwould cause the least friction, while maintaining consistency and practicality.
Making a radical departure from status quo (e.g. option 1) should have considered debate and not be 'rushed' in as a quick 'fix' to a supposed problem.

Jeff    On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:  

On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback <jeffreback at gmail.com> wrote:

> We have been discussing this amongst the pandas core developers for
some time, and the general consensus is to adopt Option 1 (sum of
all-NA or empty is 0) as the behavior for sum with skipna=True.

Actually, no there has not been general consensus among the core developers.

I think that's the preference of the majority though.

Everyone loves to say that
s.sum([NA]) == 0 makes a ton of sense, but then you have my simple examplefrom original issue, which Nathaniel did quote and I'll repeat here (with a small modification):
In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})
In [3]: dfOut[3]:     A    B0 NaN  NaN1 NaN  0.0
In [4]: df.sum()Out[4]: A    NaNB    0.0dtype: float64

Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact that you have 0present in B. If you conflate these, you then have a situation where I do notknow that I had a valid value in B.
Option 2 (and 3) for that matter preserves [4]. This DOES NOT loseinformation. No argument has been presented at all why this should not hold.
>From [4] it follows that sum([NA]) must be NA.

Extending that slightly:

In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]})

In [5]: df.sum()
Out[5]:
A    NaN
B    0.0
C    0.0
dtype: float64

This is why I don't think the "preserving information" argument is correct. Taking "Preserving information"
to its logical conclusion would return NaN for "B", since that distinguishes between the sum of all
valid and the the sum with some NaNs.

I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA is more consistent withthe rest of pandas (IOW *every* other operation on an empty Series returns NA).
> * We should prepare a 0.21.1 release in short order with Option 1
implemented for sum() (always 0 for empty/all-null) and prod() (1,
respectively)

I can certainly understand pandas reverting back to the de-facto state of affairs priorto 0.21.0, which would be option 3, but a radical change on a minor release isnot warranted at all. Frankly, we only have (and are likely to get) even a smallfraction of users opinions on this whole matter.

Yeah, agreed that bumping to 0.22 is for the best.

Jeff

On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney <wesmckinn at gmail.com> wrote:

We have been discussing this amongst the pandas core developers for
some time, and the general consensus is to adopt Option 1 (sum of
all-NA or empty is 0) as the behavior for sum with skipna=True.

In a groupby setting, and with categorical group keys, the issue
becomes a bit more nuanced -- if you group by a categorical, and one
of the categories is not observed at all in the dataset, e.g:

s.groupby(some_categorical).su m()

This change will necessarily yield a Series containing no nulls -- so
if there is a category containing no data, then the sum for that
category is 0.

For the sake of algebraic completeness, I believe we should introduce
a new aggregation method that performs Option 2 (equivalent to what
pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
yields NA.

So the TL;DR is:

* We should prepare a 0.21.1 release in short order with Option 1
implemented for sum() (always 0 for empty/all-null) and prod() (1,
respectively)
* Add a new method for Option 2, either in 0.21.1 or in a later release

We should probably alert the long GitHub thread that this discussion
is taking place before we cut the release. Since GitHub comments can
be permanently deleted at any time, I think it's better for
discussions about significant issues like this to take place on the
permanent public record.

Thanks
Wes

On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it> wrote:
> Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
>> [...]
>
> I think Nathaniel just expressed my thoughts better than I was/would be
> able to!
>
> Pietro
> ______________________________ _________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailma n/listinfo/pandas-dev
______________________________ _________________
Pandas-dev mailing list
Pandas-dev at python.org
https://mail.python.org/mailma n/listinfo/pandas-dev

______________________________ _________________
Pandas-dev mailing list
Pandas-dev at python.org
https://mail.python.org/ mailman/listinfo/pandas-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171208/37d1372e/attachment-0001.html>