[Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?)

Ryan Bressler ryan at theoremlp.com
Tue Dec 12 18:00:39 EST 2017


I posted some brief feedback on the Issue tracker and Joris asked me to
weight in here with our experience.

First off some numbers. We maintain about ~30k line scientific python with
a team of ~6 (and growing) researchers and engineers. I've just started to
audit the code base for this issue but a quick grep reveals about 170
invocations of "sum" though some of those are numpy (more on that in
second).

I recently tried to upgrade to pandas .21 and a large number or our unit
tests failed. For now we'll stay at .20 but this incident is also causing
us to discuss limiting the use of pandas in our code base.

We are in the financial industry and a lot of these invocations sum
monetary amounts where pd.Series([]).sum() == 0 makes sense and may even be
a common occurrence especially when aggregating via groupby or similar. Ie
questions like "how many total dollars of apples were sold on Tuesday" are
common and often have answer 0.

However, the less domain specific and perhaps more insidious way this
breaks our code is that we use a mix of pandas and numpy. We tend to use
pandas for dealing with mixed data types and prototyping in pandas and then
using pure numpy in areas where we care about speed or need to interface
with scikit learn etc. This change means that pandas and numpy collections
have very similar interface and but very different behavior.

Further we have this nasty behavior:

>>> np.sum(pd.Series([]))
nan

At first glance there isn't really a clean or consistent way for us to deal
with this. If it isn't reverted we're in for a lot of careful auditing and
special casing. For many sections of code it may be simplest to just
eliminate pandas use.

We are quite strict about dependency management which will allow us to
avoid the problematic versions. However, having worked in the academic
research previously I'd encourage you all to minimize headaches for
downstream package maintainers / users by minimizing the number of releases
with this inconsistent behavior.

Thanks for reading and for all your hard work:
Ryan Bressler
Theorem LP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171212/b8b0ff7c/attachment.html>


More information about the Pandas-dev mailing list