[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Sat Dec 9 07:10:44 EST 2017

On Fri, Dec 8, 2017 at 6:41 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Fri, Dec 8, 2017 at 4:17 PM Jeff Reback <jreback at yahoo.com> wrote:
>
>> From Stephan Hoyer <shoyer at gmail.com>
>>
>
>> > Yes, in most cases. But this isn't what skipna=True does, which is
>> explicitly an indication to skip NaNs.
>>
>
>> Here's where we differ. skipna=True does not mean, let's remove the
>> NaN's and then compute
>> the operation, rather it means, ignore the NaN's in computing the
>> operation. These are distinct
>> and the crux of NaN propagation. This is simply a practical view of
>> things.
>>
>
> I think "skipping" vs "ignore in the calculation" is too subtle of a
> distinction to insist on users understanding from a docstring/argument name.
>
> Sure one could always mask the NaN's themselves and do anything, but again
>> I WILL belabor the point. Pandas
>> is meant to be obvious and sensible.
>>
>
> If nothing else, this debate should make it very clear that there is no
> single "obvious and sensible" answer to how  an empty or all-null sum
> should work. If it would help, I volunteer to survey my Twitter followers
> about which behavior they think is obvious ;).
>
> The best we can do is consider various use cases and clearly explain our
> reasoning/decision, with the recognition that it is not possible to satisfy
> everyone.
>
Finally, we have a very very limited response of users / developers here
>> (in this thread). I could be completely wrong,
>> but I suspect many users have been *relatively* happy with pandas choices
>> over the years.
>>
>
> Rather I would say that most users probably don't actually care about this
> debate either way. This is edge case behavior that doesn't come up everyday.
>

Agreed. Let's just emit a warning on all-NA or empty sums and *then*
we'll start hearing from people :) (that's a joke in case it wasn't clear).
The fact that
we lived with differing behavior based on bottleneck for so long is
evidence for this
not mattering too much.

Thoughts Jeff? I'm trying to gauge where you're at and what the points of
disagreement
are, as you seem to be pretty strongly against option 1 and I don't think
this should go
forward when we're this split on the issue.

Do you agree that there isn't an obviously correct solution? That any
option is valid, and
it's a matter of picking good defaults, providing options, and documenting
things well? Statements
like "In any actual real world calculation this is a disaster and the worst
possible scenario." make
me think you're strongly -1 on option 1.

Tom

>
> Cheers,
> Stephan
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171209/cfe4d5bb/attachment.html>