[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Joris Van den Bossche jorisvandenbossche at gmail.com
Thu Nov 30 20:09:10 EST 2017


*[Note for those reading it on the pydata mailing list, please answer to
pandas-dev at python.org <pandas-dev at python.org> to keep discussion
centralised there]*


Hi list,

In pandas 0.21.0 we changed the behaviour of the sum method for empty or
all-NaN Series (to consistently return NaN), see the what's note
<http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>.
This change lead to some discussion on github whether this was the right
choice we made.

But the reach of github is of course limited, and therefore we wanted to
solicit some more feedback on the mailing list. Below is given an overview
of the background of the issue and the different options.

Please keep in mind that we are not really interested in theoretical
reasons why one of the other option is better or more correct. Each of the
options has it advantages / disadvantages in practice. But it would be very
interesting to hear the consequences in actual example analysis pipelines.

Best,
Joris
Background

Before pandas 0.21.0, the behaviour of the sum of an all-NA Series depended
on whether the optional bottleneck dependency was installed. This
inconsistency was in place since the bottleneck 1.0.0 release (February
2015), and you can read more background on it in the github issue #9422
<https://github.com/pandas-dev/pandas/issues/9422>. With bottleneck, the
sum of all-NA was zero; without bottleneck, the sum was NaN.

In [2]: pd.__version__

Out[2]: '0.20.3'

In [3]: pd.options.compute.use_bottleneck = True

In [4]: Series([np.nan]).sum()

Out[4]: 0.0

In [5]: pd.options.compute.use_bottleneck = False

In [6]: Series([np.nan]).sum()

Out[6]: nan

The sum of an empty series was always 0, with or without bottleneck.

In [7]: Series([]).sum()

Out[7]: 0

For pandas 0.21, we wanted to fix this inconsistency. The return value
should not depend on whether an optional dependency is installed. After a
lengthy discussion, we opted for the original pandas behaviour to return
NaN. As a result, also the sum of an empty Series was changed to return NaN
(see the what’s new notice here
<http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>
):

In [2]: pd.__version__

Out[2]: '0.21.0'

In [3]: pd.Series([np.nan]).sum()

Out[3]: nan

In [4]: pd.Series([]).sum()

Out[4]: nan

However, after the 0.21.0 release more feedback was received about cases
where this choice is not desirable, and due to this feedback, we are
reconsidering the decision.
Options

We see three different options for the default behaviour of sum for those
two cases of empty and all-NA series:


   1.

   Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0



   -

   Behaviour of pandas < 0.21 + bottleneck installed
   -

   Consistent with NumPy, R, MATLAB, etc. (given you use the variant that
   is NA aware: nansum for numpy, na.rm=TRUE for R, ...)



   1.

   Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA



   -

   The behaviour that is introduced in 0.21.0
   -

   Consistent with SQL (although often (rightly or not) complained about)



   1.

   Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA



   -

   Behaviour of pandas < 0.21 (without bottleneck installed)
   -

   A practicable compromise (having SUM([NA]) keep the information of NA,
   while SUM([]) = 0 does not introduce NAs when there were no in the data)
   -

   But somewhat inconsistent and unique to pandas ?


We have to stress that each of those choices can be preferable depending on
the use case and has its advantages and disadvantages. Some might be more
mathematical sound, others might preserve more information about having
missing data, each can be be more consistent with a certain ecosystem, … It
is clear that there is no ‘best’ option for all case.

While we can only choose one of those options as the default behaviour,
each choice can be accompanied by new features that can make it easier for
the user to opt for a different behaviour:


   -

   When choosing option 1 or 2, we can introduce a new method (eg .total())
   or a keyword to .sum() (eg min_count) to obtain the other behaviour.
   -

   When choosing for option 2, we could provide a pd.zeroifna(..) to be
   able to convert NaN values from aggregation results into zero’s if desired
   (similar to COALESCE(expr, 0) in SQL)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171201/93116ea9/attachment-0001.html>


More information about the Pandas-dev mailing list