[Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?)

Chris Bartak cbartak at gmail.com
Mon Dec 4 17:14:04 EST 2017


Here's a brief 'dissenting option' for option #2.  To be clear I'm not
really trying to convince anyone, and I am OK reverting to option #1 but
here's the rationale

I came to pandas from more a of SQL/BI/Excel/etc background rather than a
scientific computing one.  I think there are two things (biases) that came
along with this:
  1)  Majority of things done with pandas were from externally generated
data, generally 'messy'
  2)  Core abstraction / unit of thought was *entire columns*.   A column
is not a collection of scalar values, or an ndarray wrapper, or etc.., it
was generally the lowest level thing I work with.

>From that point of view, option #2, though at some level inconsistent, is
actually convenient.

Missing data *within *a column is normal and generally expected from
whatever I'm parsing, so it's nice that aggregations just work.

An *entirely missing *column is exceptional - I'm happy that information
propagates through aggregations and lets me know something is likely wrong.

On Mon, Dec 4, 2017 at 12:27 PM, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
>
> On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>
>> > We have been discussing this amongst the pandas core developers for
>> some time, and the general consensus is to adopt Option 1 (sum of
>> all-NA or empty is 0) as the behavior for sum with skipna=True.
>>
>> Actually, no there has not been general consensus among the core
>> developers.
>>
>
> I think that's the preference of the majority though.
>
>
>> Everyone loves to say that
>>
>> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple
>> example
>> from original issue, which Nathaniel did quote and I'll repeat here (with
>> a small modification):
>>
>> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]})
>>
>> In [3]: df
>> Out[3]:
>>     A    B
>> 0 NaN  NaN
>> 1 NaN  0.0
>>
>> In [4]: df.sum()
>> Out[4]:
>> A    NaN
>> B    0.0
>> dtype: float64
>>
>>
>> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact
>> that you have 0
>> present in B. If you conflate these, you then have a situation where I do
>> not
>> know that I had a valid value in B.
>>
>> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose
>> information. No argument has been presented at all why this should not
>> hold.
>>
>> From [4] it follows that sum([NA]) must be NA.
>>
>
> Extending that slightly:
>
>
> In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C':
> [0, 0]})
>
> In [5]: df.sum()
> Out[5]:
> A    NaN
> B    0.0
> C    0.0
> dtype: float64
>
> This is why I don't think the "preserving information" argument is
> correct. Taking "Preserving information"
> to its logical conclusion would return NaN for "B", since that
> distinguishes between the sum of all
> valid and the the sum with some NaNs.
>
> I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA
>> is more consistent with
>> the rest of pandas (IOW *every* other operation on an empty Series
>> returns NA).
>>
>> > * We should prepare a 0.21.1 release in short order with Option 1
>> implemented for sum() (always 0 for empty/all-null) and prod() (1,
>> respectively)
>>
>> I can certainly understand pandas reverting back to the de-facto state of
>> affairs prior
>> to 0.21.0, which would be option 3, but a radical change on a minor
>> release is
>> not warranted at all. Frankly, we only have (and are likely to get) even
>> a small
>> fraction of users opinions on this whole matter.
>>
>
> Yeah, agreed that bumping to 0.22 is for the best.
>
>
>> Jeff
>>
>>
>> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>
>>> We have been discussing this amongst the pandas core developers for
>>> some time, and the general consensus is to adopt Option 1 (sum of
>>> all-NA or empty is 0) as the behavior for sum with skipna=True.
>>>
>>> In a groupby setting, and with categorical group keys, the issue
>>> becomes a bit more nuanced -- if you group by a categorical, and one
>>> of the categories is not observed at all in the dataset, e.g:
>>>
>>> s.groupby(some_categorical).sum()
>>>
>>> This change will necessarily yield a Series containing no nulls -- so
>>> if there is a category containing no data, then the sum for that
>>> category is 0.
>>>
>>> For the sake of algebraic completeness, I believe we should introduce
>>> a new aggregation method that performs Option 2 (equivalent to what
>>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA
>>> yields NA.
>>>
>>> So the TL;DR is:
>>>
>>> * We should prepare a 0.21.1 release in short order with Option 1
>>> implemented for sum() (always 0 for empty/all-null) and prod() (1,
>>> respectively)
>>> * Add a new method for Option 2, either in 0.21.1 or in a later release
>>>
>>> We should probably alert the long GitHub thread that this discussion
>>> is taking place before we cut the release. Since GitHub comments can
>>> be permanently deleted at any time, I think it's better for
>>> discussions about significant issues like this to take place on the
>>> permanent public record.
>>>
>>> Thanks
>>> Wes
>>>
>>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston <ml at pietrobattiston.it>
>>> wrote:
>>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto:
>>> >> [...]
>>> >
>>> > I think Nathaniel just expressed my thoughts better than I was/would be
>>> > able to!
>>> >
>>> > Pietro
>>> > _______________________________________________
>>> > Pandas-dev mailing list
>>> > Pandas-dev at python.org
>>> > https://mail.python.org/mailman/listinfo/pandas-dev
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20171204/166b4ff1/attachment.html>


More information about the Pandas-dev mailing list