[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Wes McKinney wesmckinn at gmail.com
Sun Dec 10 16:25:53 EST 2017


> remember that your customer is an applied mathematician

Please, please do not use the term "customer" to apply to a user of
pandas. A customer is someone who buy things with money. We are not
receiving money from you and correspondingly do not have the kinds of
obligations that you are suggesting.

> given that your target audience (customers) are (applied) mathematicians

We do not take this as a given.

Thanks
Wes

On Sun, Dec 10, 2017 at 3:55 PM, Sam Steingold <sds at gnu.org> wrote:
> Hi,
>
>> * Joris Van den Bossche <wbevfinaqraobffpur-Er5WDRrDdr8NikgvhZjk3j at choyvp.tznar.bet> [2017-12-01 02:09:10 +0100]:
>>
>> In pandas 0.21.0 we changed the behaviour of the sum method for empty or
>> all-NaN Series (to consistently return NaN), see the what's note
>> <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#sum-prod-of-all-nan-or-empty-series-dataframes-is-now-consistently-nan>.
>> This change lead to some discussion on github whether this was the right
>> choice we made.
>
> I am afraid I must disagree with the _framing_ of the question.
> You are talking about "empty or all-NA" series, i.e., series without any
> valid data (i.e., s.isnull().all() is true).
> Instead, the true question is "some-NA" series, i.e., series
> contaminated with invalid/missing data (i.e., s.isnull().any() is true).
>
> If some of your data is missing (== NA/NaN/None is present), you can
> contemplate what to do: ignore the missing records and work with the
> available data or return NA.
>
> However, if there is no missing data (NA/NaN/None), there is _no_
> question of what is the right approach - you just use what you have,
> mathematically.
>
> The only situation where my framing is different from yours is when the
> data set is empty (i.e., the list or series has 0 length), and my point
> here is that, mathematically, there is _NO_ question what the right
> answer is.
>
> NB: I understand and appreciate that math is not your only
> consideration, but, given that your target audience (customers) are
> (applied) mathematicians, you might want to consider our opinion when
> making design decisions that affect us.
>
> So, what is sum([])?  It is 0 because addition is associative:
> sum(list1 + list2) == sum(list1) + sum(list2)
> Since list1 == list1 + [] for any list1, we must have
> sum(list1) == sum(list1 + []) == sum(list1) + sum([])
> thus sum([])==0.
> Therefore pd.concat([s1,s2]).sum() should be the same as s1.sum() +
> s2.sum() for any s1 and s2, and, indeed, in 0.20.3 (but not in in 0.21):
> --8<---------------cut here---------------start------------->8---
>>>> pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()
> True
> --8<---------------cut here---------------end--------------->8---
>
> `pd.Series([]).sum()` should be 0 - because math says so.
> Returning anything else violates associativity of addition and is a bug.
>
> Moreover, _all_ known languages/systems do return 0 on empty sums (with
> a prominent exception of SQL -- where Postgres and SQLite say in their
> docs that they implement the behavior required by the standard but,
> since it is obviously wrong, they also offer non-standard functionality
> which does the right thing).
>
> Now, let us step back. The reason an empty set has to sum up to 0 is
> that 0 is the neutral element for addition: 0+x=x for any x.
> This means that for other associative group operations the operation on
> an empty set is the neutral element of that operation, e.g.:
>
> product([]) = 1 because 1*x=x for any x
> max([]) = -inf because max(-inf,x) = x for any x
> min([]) = +inf because min(+inf,x) = x for any x
> (max and min -- only if you can handle infinities consistently
> everywhere, otherwise raising an exception is fine).
>
> mean and std stand aside: these are _not_ basic arithmetic operations,
> they are defined based on other operations, and thus:
>
> mean([]) = NA (or, better yet, raises an exception)
> std([])  = NA (or, better yet, raises an exception)
> std([x]) = NA (or, better yet, raises an exception)
>
> Again, while I do understand that math is not the only consideration, I
> beg you to remember that your customer is an applied mathematician like
> yours truly and we have certain expectations from the basic math
> operations.
> Please do not surprise us like this! ;-)
> If you do, you will get an endless stream of bug reports that sum([])
> must be 0 no matter how you handle missing data.
>
> Thank you very much for your attention.
>
> PS. ISTR a claim that the Series.sum method is somehow a different beast
> from addition of scalars. Are its authors suggesting that the identity
> Series([1,2,3]).sum() == 1+2+3 is a happy accident, not really
> guaranteed by the contract of the Series class?
>
> --
> Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
> http://steingoldpsychology.com http://www.childpsy.net http://camera.org
> http://memri.org https://ffii.org http://islamexposedonline.com
> Selling grief is easier than buying happiness.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev


More information about the Pandas-dev mailing list