[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Sam Steingold sds at gnu.org
Sun Dec 10 15:55:54 EST 2017


Hi,

> * Joris Van den Bossche <wbevfinaqraobffpur-Er5WDRrDdr8NikgvhZjk3j at choyvp.tznar.bet> [2017-12-01 02:09:10 +0100]:
>
> In pandas 0.21.0 we changed the behaviour of the sum method for empty or
> all-NaN Series (to consistently return NaN), see the what's note
> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>.
> This change lead to some discussion on github whether this was the right
> choice we made.

I am afraid I must disagree with the _framing_ of the question.
You are talking about "empty or all-NA" series, i.e., series without any
valid data (i.e., s.isnull().all() is true).
Instead, the true question is "some-NA" series, i.e., series
contaminated with invalid/missing data (i.e., s.isnull().any() is true).

If some of your data is missing (== NA/NaN/None is present), you can
contemplate what to do: ignore the missing records and work with the
available data or return NA.

However, if there is no missing data (NA/NaN/None), there is _no_
question of what is the right approach - you just use what you have,
mathematically.

The only situation where my framing is different from yours is when the
data set is empty (i.e., the list or series has 0 length), and my point
here is that, mathematically, there is _NO_ question what the right
answer is.

NB: I understand and appreciate that math is not your only
consideration, but, given that your target audience (customers) are
(applied) mathematicians, you might want to consider our opinion when
making design decisions that affect us.

So, what is sum([])?  It is 0 because addition is associative:
sum(list1 + list2) == sum(list1) + sum(list2)
Since list1 == list1 + [] for any list1, we must have
sum(list1) == sum(list1 + []) == sum(list1) + sum([])
thus sum([])==0.
Therefore pd.concat([s1,s2]).sum() should be the same as s1.sum() +
s2.sum() for any s1 and s2, and, indeed, in 0.20.3 (but not in in 0.21):
--8<---------------cut here---------------start------------->8---
>>> pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()
True
--8<---------------cut here---------------end--------------->8---

`pd.Series([]).sum()` should be 0 - because math says so.
Returning anything else violates associativity of addition and is a bug.

Moreover, _all_ known languages/systems do return 0 on empty sums (with
a prominent exception of SQL -- where Postgres and SQLite say in their
docs that they implement the behavior required by the standard but,
since it is obviously wrong, they also offer non-standard functionality
which does the right thing).

Now, let us step back. The reason an empty set has to sum up to 0 is
that 0 is the neutral element for addition: 0+x=x for any x.
This means that for other associative group operations the operation on
an empty set is the neutral element of that operation, e.g.:

product([]) = 1 because 1*x=x for any x
max([]) = -inf because max(-inf,x) = x for any x
min([]) = +inf because min(+inf,x) = x for any x
(max and min -- only if you can handle infinities consistently
everywhere, otherwise raising an exception is fine).

mean and std stand aside: these are _not_ basic arithmetic operations,
they are defined based on other operations, and thus:

mean([]) = NA (or, better yet, raises an exception)
std([])  = NA (or, better yet, raises an exception)
std([x]) = NA (or, better yet, raises an exception)

Again, while I do understand that math is not the only consideration, I
beg you to remember that your customer is an applied mathematician like
yours truly and we have certain expectations from the basic math
operations.
Please do not surprise us like this! ;-)
If you do, you will get an endless stream of bug reports that sum([])
must be 0 no matter how you handle missing data.

Thank you very much for your attention.

PS. ISTR a claim that the Series.sum method is somehow a different beast
from addition of scalars. Are its authors suggesting that the identity
Series([1,2,3]).sum() == 1+2+3 is a happy accident, not really
guaranteed by the contract of the Series class?

-- 
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://camera.org
http://memri.org https://ffii.org http://islamexposedonline.com
Selling grief is easier than buying happiness.


More information about the Pandas-dev mailing list