[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Pietro Battiston me at pietrobattiston.it
Tue Dec 12 19:23:06 EST 2017


Il giorno dom, 10/12/2017 alle 16.09 +0000, Jeff Reback via Pandas-dev
ha scritto:
> > I think "skipping" vs "ignore in the calculation" is too subtle of
> a distinction to insist on users understanding from a
> docstring/argument name.
> 
> I agree. When I see skip, I dont' assume that we should simply remove
> them and recompute. I understand this is what numpy
> does, but it is NOT what pandas does, nor has ever done. Again this
> would just shock people.

Not only I'm not shocked by this possibility, but after reading
multiple times, I still fail to understand how "ignore in the
calculation" conceptually differs from "skip and then calculate".


> I am pushing back on this entire issue because it seems that lots of
> folks are just assuming, since numpy does it and R does it is
> automatically
> correct. Well, pandas has never completely followed semantics, just
> because someone else does it.

While I would not put numpy and R at the same level - most users using
pandas will sooner or later use numpy, while the same might not be true
for R - I agree with your general argument.
However for me the point is not "we should do what they did". It rather
is "if we do something different, either they should be regretting
their decision, or the need of the users are different... or we are
wrong". Now, from this discussion I understand that it is SQL
developers who are regretting a design decision, and it is not obvious
to me how user expectations should differ between R and pandas (that
is, why R users should dislike a "practical way to view things").

Two more quick points I would like to add:
- all else equal, it is better if the (default, at least) behavior can
be described with less words than more: and this is where mathematical
purity is positively correlated with practicality
- I entirely agree with Stephan when he says that most users probably
just never encountered the edge case we are discussing about... at
least if he means sum([NA]), which is indeed pretty rare (despite
having a pretty clear preference on what I would like the behavior to
be if I happened to face it, I admit I might have never taken the sum
of a variable with only missing values).
If we are talking about sum([]), however, this is a different story,
and I'm ready to bet that some previously written code of mine _was_
broken by 0.21.0.
This for me means two things: that on sum([NA]), we will hardly get
much users feedback "out of experience", and that on sum([]), assuming
we agree to revert to the pre-0.21.0 behavior, sooner is better than
later.

Pietro


More information about the Pandas-dev mailing list