[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Jon Mease jon.mease at gmail.com
Sat Dec 2 09:30:39 EST 2017


Seems the plain text version of the Octave code I posted has some artifacts
in the archive.  Here is a cleaner version.

octave:7> sum([])
    ans = 0

octave:8> sum([nan])
    ans = NaN

octave:9> sum([nan, 0])
    ans = NaN

octave:10> prod([])
    ans =  1

octave:11> prod([nan])
    ans = NaN

octave:12> prod([nan, 0])
    ans = NaN


On Sat, Dec 2, 2017 at 9:26 AM, Jon Mease <jon.mease at gmail.com> wrote:

> Hi all,
>      I'd just like to chime in to say that I find option 3 to be the most
> intuitive. Also, I just checked and option 3 is the behavior of
> MATLAB/Octave with the default sum function (not nansum).  Below is a
> console session demonstrating this behavior from
> https://octave-online.net/
>
> octave:7> sum([])
> ans = 0octave:8> sum([nan])
> ans = NaNoctave:9> sum([nan, 0])
> ans = NaN
>
>
> Also, to echo Pietro's comment on prod, I find the behavior analogous to
> option 3 to be intuitive as well.  This is also the behavior of
> MATLAB/Octave (see below)
>
> octave:10> prod([])
> ans =  1octave:11> prod([nan])
> ans = NaNoctave:12> prod([nan, 0])
> ans = NaN
>
>
> If this behavior were the default, then option 1 behavior (and
> MATLAB/Octave nansum behavior) could be attained by setting skipna=True.
> I would personally prefer for skipna to default to False so that ignoring
> null values is a conscious choice.
>
> Thanks for the consideration,
>
> -Jon
>
> On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> *[Note for those reading it on the pydata mailing list, please answer to
>> pandas-dev at python.org <pandas-dev at python.org> to keep discussion
>> centralised there]*
>>
>>
>> Hi list,
>>
>> In pandas 0.21.0 we changed the behaviour of the sum method for empty or
>> all-NaN Series (to consistently return NaN), see the what's note
>> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>.
>> This change lead to some discussion on github whether this was the right
>> choice we made.
>>
>> But the reach of github is of course limited, and therefore we wanted to
>> solicit some more feedback on the mailing list. Below is given an overview
>> of the background of the issue and the different options.
>>
>> Please keep in mind that we are not really interested in theoretical
>> reasons why one of the other option is better or more correct. Each of the
>> options has it advantages / disadvantages in practice. But it would be very
>> interesting to hear the consequences in actual example analysis pipelines.
>>
>> Best,
>> Joris
>> Background
>>
>> Before pandas 0.21.0, the behaviour of the sum of an all-NA Series
>> depended on whether the optional bottleneck dependency was installed. This
>> inconsistency was in place since the bottleneck 1.0.0 release (February
>> 2015), and you can read more background on it in the github issue #9422
>> <https://github.com/pandas-dev/pandas/issues/9422>. With bottleneck, the
>> sum of all-NA was zero; without bottleneck, the sum was NaN.
>>
>> In [2]: pd.__version__
>>
>> Out[2]: '0.20.3'
>>
>> In [3]: pd.options.compute.use_bottleneck = True
>>
>> In [4]: Series([np.nan]).sum()
>>
>> Out[4]: 0.0
>>
>> In [5]: pd.options.compute.use_bottleneck = False
>>
>> In [6]: Series([np.nan]).sum()
>>
>> Out[6]: nan
>>
>> The sum of an empty series was always 0, with or without bottleneck.
>>
>> In [7]: Series([]).sum()
>>
>> Out[7]: 0
>>
>> For pandas 0.21, we wanted to fix this inconsistency. The return value
>> should not depend on whether an optional dependency is installed. After a
>> lengthy discussion, we opted for the original pandas behaviour to return
>> NaN. As a result, also the sum of an empty Series was changed to return NaN
>> (see the what’s new notice here
>> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>
>> ):
>>
>> In [2]: pd.__version__
>>
>> Out[2]: '0.21.0'
>>
>> In [3]: pd.Series([np.nan]).sum()
>>
>> Out[3]: nan
>>
>> In [4]: pd.Series([]).sum()
>>
>> Out[4]: nan
>>
>> However, after the 0.21.0 release more feedback was received about cases
>> where this choice is not desirable, and due to this feedback, we are
>> reconsidering the decision.
>> Options
>>
>> We see three different options for the default behaviour of sum for
>> those two cases of empty and all-NA series:
>>
>>
>>    1.
>>
>>    Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0
>>
>>
>>
>>    -
>>
>>    Behaviour of pandas < 0.21 + bottleneck installed
>>    -
>>
>>    Consistent with NumPy, R, MATLAB, etc. (given you use the variant
>>    that is NA aware: nansum for numpy, na.rm=TRUE for R, ...)
>>
>>
>>
>>    1.
>>
>>    Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA
>>
>>
>>
>>    -
>>
>>    The behaviour that is introduced in 0.21.0
>>    -
>>
>>    Consistent with SQL (although often (rightly or not) complained about)
>>
>>
>>
>>    1.
>>
>>    Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA
>>
>>
>>
>>    -
>>
>>    Behaviour of pandas < 0.21 (without bottleneck installed)
>>    -
>>
>>    A practicable compromise (having SUM([NA]) keep the information of
>>    NA, while SUM([]) = 0 does not introduce NAs when there were no in the data)
>>    -
>>
>>    But somewhat inconsistent and unique to pandas ?
>>
>>
>> We have to stress that each of those choices can be preferable depending
>> on the use case and has its advantages and disadvantages. Some might be
>> more mathematical sound, others might preserve more information about
>> having missing data, each can be be more consistent with a certain
>> ecosystem, … It is clear that there is no ‘best’ option for all case.
>>
>> While we can only choose one of those options as the default behaviour,
>> each choice can be accompanied by new features that can make it easier for
>> the user to opt for a different behaviour:
>>
>>
>>    -
>>
>>    When choosing option 1 or 2, we can introduce a new method (eg
>>    .total()) or a keyword to .sum() (eg min_count) to obtain the other
>>    behaviour.
>>    -
>>
>>    When choosing for option 2, we could provide a pd.zeroifna(..) to be
>>    able to convert NaN values from aggregation results into zero’s if desired
>>    (similar to COALESCE(expr, 0) in SQL)
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171202/acf76c16/attachment-0001.html>


More information about the Pandas-dev mailing list