[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Fri Dec 8 11:02:30 EST 2017

On Fri, Dec 8, 2017 at 9:38 AM, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
>
> On Fri, Dec 8, 2017 at 6:19 AM, Jeff Reback <jreback at yahoo.com> wrote:
>
>> Using Tom's example
>>
>> In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C':
>> [0, 0]})
>>
>> In [2]: df
>> Out[2]:
>>     A    B  C
>> 0 NaN  NaN  0
>> 1 NaN  0.0  0
>>
>> In [3]: df.sum()
>> Out[3]:
>> A    NaN
>> B    0.0
>> C    0.0
>> dtype: float64
>>
>>
>> Pandas is all about propagating NaN's in a reliable and predictable way.
>> Folks do a series of calculations, preserving NaNs. For examples
>>
>> In [5]: df.sum() + 1
>> Out[5]:
>> A    NaN
>> B    1.0
>> C    1.0
>> dtype: float64
>>
>> makes it perfectly obvious that we have NaN preserving operations
>>
>
> We don't always though. Aggregations explicitly skip NaNs:
>
> In [3]: pd.Series([1, np.nan]).sum()
> Out[3]: 1.0
>
> I don't think "how aggregations handle NA" need be consistent with "how
> binops handle NA".
>
>
>> Option 1 is essentially:
>>
>> In [4]: df.fillna(0).sum()
>> Out[4]:
>> A    0.0
>> B    0.0
>> C    0.0
>> dtype: float64
>>
> Using the same operation as [5], but showing all NaN sum to 0, we have
>> have the situation
>> where we are no longer NaN preserving. In any actual real world
>> calculation this is a disaster
>> and the worst possible scenario.
>>
>> In [6]: df.fillna(0).sum() + 1
>> Out[6]:
>> A    1.0
>> B    1.0
>> C    1.0
>> dtype: float64
>>
>>
>> Changing this behavior shakes the core tenants of pandas, suddenly we
>> have a special case
>> where NaN propagation is not important anymore and worse you may get
>> wrong answers.
>>
>> We have always consistently allowed reduction operations to return NaN
>> (with the exception of count, which is actually
>> counting non-nans).
>>
>> I would argue that the folks who want guaranteed zero for all-NaN, can
>> simply fill first. The reverse operation is simply
>> not possible, nor desired in any actual real world scenario.
>>
>
> If we pursue option 1, we would add a keyword to make the reverse
> operation possible.
>
> I think the best analogy here is to `skipna`. The argument "people should
> fill first" applies equally well to people who
> say `skipna` should be False by default, because that propagates NaNs (not
> that anyone *is* arguing that). If we add
> a keyword to sum like `all_na_is_na` that's equivalent to `skipna`, then
> we have:
>
>
> >>> df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0,
> 0]})
> >>> df.sum(skipna=True, all_na_is_na=False)  # the default
> A    0.0
> B    0.0
> C    0.0
> dtype: float64
>
> >>> df.sum(skipna=True, all_na_is_na=True)
> A    NaN
> B    0.0
> C    0.0
> dtype: float64
>
> >>> df.sum(skipna=False, all_na_is_na=True)
> A    NaN
> B    NaN
> C    0.0
> dtype: float64
>
> >>> df.sum(skipna=False, all_na_is_na=False)  # ValueError?
>
> So we shouldn't be discussing which one is possible. Both will be, it's a
> matter of  choosing the defaults.
>
>
>
>> Pandas is not about strictly mathematical purity, rather about real world
>> utility.
>>
>> As for a decent compromise, option 3 is almost certainly the best option,
>> where we revert the sum([]) == NA to be 0.  This would put
>> us back to pre-0.21.0 pandas without bottleneck, likely the biggest
>> installed population. This option
>> would cause the least friction, while maintaining consistency and
>> practicality.
>>
>> Making a radical departure from status quo (e.g. option 1) should have
>> considered debate and not be 'rushed' in as a
>> quick 'fix' to a supposed problem.
>>
>
> I don't think we're rushing things. I'm not holding out hope for unanimous
> agreement, but at some point we will
> need to do a release. I have a slight preference for getting things done
> sooner, so that 0.21.0 is used by as
> few people as possible. But getting things right for the next release is
> the most important thing.
>

In case email is too low bandwidth for this discussion (and how it affects
the next releases naming and timing), I'm
free to do a video chat any time today, and post a summary to the mailing
list on what we cover. How about
17:30 UTC (1.5 hours from now?). I'm flexible, though that's 5:30 PM in
Europe so the soon the better for them.

Tom

Tom
>
>
>> Jeff
>> On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger <
>> tom.augspurger88 at gmail.com> wrote:
>>
>>
>>
>>
>> On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback <jeffreback at gmail.com>
>> wrote:
>>
>> > We have been discussing this amongst the pandas core developers for
>> some time, and the general consensus is to adopt Option 1 (sum of
>> all-NA or empty is 0) as the behavior for sum with skipna=True.
>>
>> Actually, no there has not been general consensus among the core
>> developers.
>>
>>
>> I think that's the preference of the majority though.
>>
>>
>> Everyone loves to say that
>>
>> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple
>> example
>> from original issue, which Nathaniel did quote and I'll repeat here (with
>> a small modification):
>>
>> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})
>>
>> In [3]: df
>> Out[3]:
>>     A    B
>> 0 NaN  NaN
>> 1 NaN  0.0
>>
>> In [4]: df.sum()
>> Out[4]:
>> A    NaN
>> B    0.0
>> dtype: float64
>>
>>
>> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact
>> that you have 0
>> present in B. If you conflate these, you then have a situation where I do
>> not
>> know that I had a valid value in B.
>>
>> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose
>> information. No argument has been presented at all why this should not
>> hold.
>>
>> From [4] it follows that sum([NA]) must be NA.
>>
>>
>> Extending that slightly:
>>
>>
>> In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C':
>> [0, 0]})
>>
>> In [5]: df.sum()
>> Out[5]:
>> A    NaN
>> B    0.0
>> C    0.0
>> dtype: float64
>>
>> This is why I don't think the "preserving information" argument is
>> correct. Taking "Preserving information"
>> to its logical conclusion would return NaN for "B", since that
>> distinguishes between the sum of all
>> valid and the the sum with some NaNs.
>>
>> I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA
>> is more consistent with
>> the rest of pandas (IOW *every* other operation on an empty Series
>> returns NA).
>>
>> > * We should prepare a 0.21.1 release in short order with Option 1
>> implemented for sum() (always 0 for empty/all-null) and prod() (1,
>> respectively)
>>
>> I can certainly understand pandas reverting back to the de-facto state of
>> affairs prior
>> to 0.21.0, which would be option 3, but a radical change on a minor
>> release is
>> not warranted at all. Frankly, we only have (and are likely to get) even
>> a small
>> fraction of users opinions on this whole matter.
>>
>>
>> Yeah, agreed that bumping to 0.22 is for the best.
>>
>>
>> Jeff
>>
>>
>> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>
>> We have been discussing this amongst the pandas core developers for
>> some time, and the general consensus is to adopt Option 1 (sum of
>> all-NA or empty is 0) as the behavior for sum with skipna=True.
>>
>> In a groupby setting, and with categorical group keys, the issue
>> becomes a bit more nuanced -- if you group by a categorical, and one
>> of the categories is not observed at all in the dataset, e.g:
>>
>> s.groupby(some_categorical).su m()
>>
>> This change will necessarily yield a Series containing no nulls -- so
>> if there is a category containing no data, then the sum for that
>> category is 0.
>>
>> For the sake of algebraic completeness, I believe we should introduce
>> a new aggregation method that performs Option 2 (equivalent to what
>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
>> yields NA.
>>
>> So the TL;DR is:
>>
>> * We should prepare a 0.21.1 release in short order with Option 1
>> implemented for sum() (always 0 for empty/all-null) and prod() (1,
>> respectively)
>> * Add a new method for Option 2, either in 0.21.1 or in a later release
>>
>> We should probably alert the long GitHub thread that this discussion
>> is taking place before we cut the release. Since GitHub comments can
>> be permanently deleted at any time, I think it's better for
>> discussions about significant issues like this to take place on the
>> permanent public record.
>>
>> Thanks
>> Wes
>>
>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it>
>> wrote:
>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
>> >> [...]
>> >
>> > I think Nathaniel just expressed my thoughts better than I was/would be
>> > able to!
>> >
>> > Pietro
>> > ______________________________ _________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailma n/listinfo/pandas-dev
>> <https://mail.python.org/mailman/listinfo/pandas-dev>
>> ______________________________ _________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailma n/listinfo/pandas-dev
>> <https://mail.python.org/mailman/listinfo/pandas-dev>
>>
>>
>>
>> ______________________________ _________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/ mailman/listinfo/pandas-dev
>> <https://mail.python.org/mailman/listinfo/pandas-dev>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171208/1c67237d/attachment-0001.html>