[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Mon Dec 4 13:27:09 EST 2017

On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback <jeffreback at gmail.com> wrote:

> > We have been discussing this amongst the pandas core developers for
> some time, and the general consensus is to adopt Option 1 (sum of
> all-NA or empty is 0) as the behavior for sum with skipna=True.
>
> Actually, no there has not been general consensus among the core
> developers.
>

I think that's the preference of the majority though.

> Everyone loves to say that
>
> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple example
> from original issue, which Nathaniel did quote and I'll repeat here (with
> a small modification):
>
> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})
>
> In [3]: df
> Out[3]:
>     A    B
> 0 NaN  NaN
> 1 NaN  0.0
>
> In [4]: df.sum()
> Out[4]:
> A    NaN
> B    0.0
> dtype: float64
>
>
> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact
> that you have 0
> present in B. If you conflate these, you then have a situation where I do
> not
> know that I had a valid value in B.
>
> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose
> information. No argument has been presented at all why this should not
> hold.
>
> From [4] it follows that sum([NA]) must be NA.
>

Extending that slightly:

In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0,
0]})

In [5]: df.sum()
Out[5]:
A    NaN
B    0.0
C    0.0
dtype: float64

This is why I don't think the "preserving information" argument is correct.
Taking "Preserving information"
to its logical conclusion would return NaN for "B", since that
distinguishes between the sum of all
valid and the the sum with some NaNs.

I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA
> is more consistent with
> the rest of pandas (IOW *every* other operation on an empty Series returns
> NA).
>
> > * We should prepare a 0.21.1 release in short order with Option 1
> implemented for sum() (always 0 for empty/all-null) and prod() (1,
> respectively)
>
> I can certainly understand pandas reverting back to the de-facto state of
> affairs prior
> to 0.21.0, which would be option 3, but a radical change on a minor
> release is
> not warranted at all. Frankly, we only have (and are likely to get) even a
> small
> fraction of users opinions on this whole matter.
>

Yeah, agreed that bumping to 0.22 is for the best.

> Jeff
>
>
> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> We have been discussing this amongst the pandas core developers for
>> some time, and the general consensus is to adopt Option 1 (sum of
>> all-NA or empty is 0) as the behavior for sum with skipna=True.
>>
>> In a groupby setting, and with categorical group keys, the issue
>> becomes a bit more nuanced -- if you group by a categorical, and one
>> of the categories is not observed at all in the dataset, e.g:
>>
>> s.groupby(some_categorical).sum()
>>
>> This change will necessarily yield a Series containing no nulls -- so
>> if there is a category containing no data, then the sum for that
>> category is 0.
>>
>> For the sake of algebraic completeness, I believe we should introduce
>> a new aggregation method that performs Option 2 (equivalent to what
>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
>> yields NA.
>>
>> So the TL;DR is:
>>
>> * We should prepare a 0.21.1 release in short order with Option 1
>> implemented for sum() (always 0 for empty/all-null) and prod() (1,
>> respectively)
>> * Add a new method for Option 2, either in 0.21.1 or in a later release
>>
>> We should probably alert the long GitHub thread that this discussion
>> is taking place before we cut the release. Since GitHub comments can
>> be permanently deleted at any time, I think it's better for
>> discussions about significant issues like this to take place on the
>> permanent public record.
>>
>> Thanks
>> Wes
>>
>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it>
>> wrote:
>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
>> >> [...]
>> >
>> > I think Nathaniel just expressed my thoughts better than I was/would be
>> > able to!
>> >
>> > Pietro
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171204/5c117aee/attachment-0001.html>