[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Sat Dec 2 20:53:34 EST 2017

Ok, thanks for the clarification. I missed the fact that these proposals
all assume skipna=True.

If we stick with option 2, should Series([]).sum(skipna=False) also equal
NaN? This seems to be the behavior of version 0.21, but this is no longer
consistent with the non-NaN-skipping version of sum in NumPy/MATLAB (which
equal 0).

-Jon

On Sat, Dec 2, 2017 at 7:46 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> We have no plans to change the default value of skipna. All of these
> proposals concern the behavior of skipna=True.
>
> skipna=False is what corresponds to sum in NumPy/R/Matlab, and pandas is
> already fully consistent there. I don't see consistency between pandas with
> skipna=True and the non-NaN-skipping sum inthese other languages as
> relevant or desirable.
>
> On Sat, Dec 2, 2017 at 6:26 AM Jon Mease <jon.mease at gmail.com> wrote:
>
>> Hi all,
>>      I'd just like to chime in to say that I find option 3 to be the most
>> intuitive. Also, I just checked and option 3 is the behavior of
>> MATLAB/Octave with the default sum function (not nansum).  Below is a
>> console session demonstrating this behavior from
>> https://octave-online.net/
>>
>> octave:7> sum([])
>> ans = 0octave:8> sum([nan])
>> ans = NaNoctave:9> sum([nan, 0])
>> ans = NaN
>>
>>
>> Also, to echo Pietro's comment on prod, I find the behavior analogous to
>> option 3 to be intuitive as well.  This is also the behavior of
>> MATLAB/Octave (see below)
>>
>> octave:10> prod([])
>> ans =  1octave:11> prod([nan])
>> ans = NaNoctave:12> prod([nan, 0])
>> ans = NaN
>>
>>
>> If this behavior were the default, then option 1 behavior (and
>> MATLAB/Octave nansum behavior) could be attained by setting skipna=True.
>> I would personally prefer for skipna to default to False so that
>> ignoring null values is a conscious choice.
>>
>> Thanks for the consideration,
>>
>> -Jon
>>
>> On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> *[Note for those reading it on the pydata mailing list, please answer to
>>> pandas-dev at python.org <pandas-dev at python.org> to keep discussion
>>> centralised there]*
>>>
>>>
>>> Hi list,
>>>
>>> In pandas 0.21.0 we changed the behaviour of the sum method for empty or
>>> all-NaN Series (to consistently return NaN), see the what's note
>>> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>.
>>> This change lead to some discussion on github whether this was the right
>>> choice we made.
>>>
>>> But the reach of github is of course limited, and therefore we wanted to
>>> solicit some more feedback on the mailing list. Below is given an overview
>>> of the background of the issue and the different options.
>>>
>>> Please keep in mind that we are not really interested in theoretical
>>> reasons why one of the other option is better or more correct. Each of the
>>> options has it advantages / disadvantages in practice. But it would be very
>>> interesting to hear the consequences in actual example analysis pipelines.
>>>
>>> Best,
>>> Joris
>>> Background
>>>
>>> Before pandas 0.21.0, the behaviour of the sum of an all-NA Series
>>> depended on whether the optional bottleneck dependency was installed. This
>>> inconsistency was in place since the bottleneck 1.0.0 release (February
>>> 2015), and you can read more background on it in the github issue #9422
>>> <https://github.com/pandas-dev/pandas/issues/9422>. With bottleneck,
>>> the sum of all-NA was zero; without bottleneck, the sum was NaN.
>>>
>>> In [2]: pd.__version__
>>>
>>> Out[2]: '0.20.3'
>>>
>>> In [3]: pd.options.compute.use_bottleneck = True
>>>
>>> In [4]: Series([np.nan]).sum()
>>>
>>> Out[4]: 0.0
>>>
>>> In [5]: pd.options.compute.use_bottleneck = False
>>>
>>> In [6]: Series([np.nan]).sum()
>>>
>>> Out[6]: nan
>>>
>>> The sum of an empty series was always 0, with or without bottleneck.
>>>
>>> In [7]: Series([]).sum()
>>>
>>> Out[7]: 0
>>>
>>> For pandas 0.21, we wanted to fix this inconsistency. The return value
>>> should not depend on whether an optional dependency is installed. After a
>>> lengthy discussion, we opted for the original pandas behaviour to return
>>> NaN. As a result, also the sum of an empty Series was changed to return NaN
>>> (see the what’s new notice here
>>> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>
>>> ):
>>>
>>> In [2]: pd.__version__
>>>
>>> Out[2]: '0.21.0'
>>>
>>> In [3]: pd.Series([np.nan]).sum()
>>>
>>> Out[3]: nan
>>>
>>> In [4]: pd.Series([]).sum()
>>>
>>> Out[4]: nan
>>>
>>> However, after the 0.21.0 release more feedback was received about cases
>>> where this choice is not desirable, and due to this feedback, we are
>>> reconsidering the decision.
>>> Options
>>>
>>> We see three different options for the default behaviour of sum for
>>> those two cases of empty and all-NA series:
>>>
>>>
>>>    1.
>>>
>>>    Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0
>>>
>>>
>>>
>>>    -
>>>
>>>    Behaviour of pandas < 0.21 + bottleneck installed
>>>    -
>>>
>>>    Consistent with NumPy, R, MATLAB, etc. (given you use the variant
>>>    that is NA aware: nansum for numpy, na.rm=TRUE for R, ...)
>>>
>>>
>>>
>>>    1.
>>>
>>>    Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA
>>>
>>>
>>>
>>>    -
>>>
>>>    The behaviour that is introduced in 0.21.0
>>>    -
>>>
>>>    Consistent with SQL (although often (rightly or not) complained
>>>    about)
>>>
>>>
>>>
>>>    1.
>>>
>>>    Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA
>>>
>>>
>>>
>>>    -
>>>
>>>    Behaviour of pandas < 0.21 (without bottleneck installed)
>>>    -
>>>
>>>    A practicable compromise (having SUM([NA]) keep the information of
>>>    NA, while SUM([]) = 0 does not introduce NAs when there were no in the data)
>>>    -
>>>
>>>    But somewhat inconsistent and unique to pandas ?
>>>
>>>
>>> We have to stress that each of those choices can be preferable depending
>>> on the use case and has its advantages and disadvantages. Some might be
>>> more mathematical sound, others might preserve more information about
>>> having missing data, each can be be more consistent with a certain
>>> ecosystem, … It is clear that there is no ‘best’ option for all case.
>>>
>>> While we can only choose one of those options as the default behaviour,
>>> each choice can be accompanied by new features that can make it easier for
>>> the user to opt for a different behaviour:
>>>
>>>
>>>    -
>>>
>>>    When choosing option 1 or 2, we can introduce a new method (eg
>>>    .total()) or a keyword to .sum() (eg min_count) to obtain the other
>>>    behaviour.
>>>    -
>>>
>>>    When choosing for option 2, we could provide a pd.zeroifna(..) to be
>>>    able to convert NaN values from aggregation results into zero’s if desired
>>>    (similar to COALESCE(expr, 0) in SQL)
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171202/669091ec/attachment-0001.html>