From ml at pietrobattiston.it Sat Dec 2 04:02:34 2017 From: ml at pietrobattiston.it (Pietro Battiston) Date: Sat, 02 Dec 2017 10:02:34 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: <1512205354.2389.90.camel@pietrobattiston.it> Hi Joris, Il giorno ven, 01/12/2017 alle 02.09 +0100, Joris Van den Bossche ha scritto: > [...]We see three different options for the default behaviour of sum > for > those two cases of empty and all-NA series: > > Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 > > Behaviour of pandas < 0.21 + bottleneck installed > Consistent with NumPy, R, MATLAB, etc. (given you use the variant > that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) > > Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA > > The behaviour that is introduced in 0.21.0 > Consistent with SQL (although often (rightly or not) complained > about) > > Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA > ??? > Behaviour of pandas < 0.21 (without bottleneck installed) > A practicable compromise (having SUM([NA]) keep the information of > NA, while SUM([]) = 0 does not introduce NAs when there were no in > the data) > But somewhat inconsistent and unique to pandas ? I'm 100% sure I want sum([]) to return 0, not NA. It's not just more elegant, it also makes much more sense in my daily workflow (e.g. if I randomly split in samples, groupby, sum, and then sum the results for each group across samples, I want the same result as if I just groupby and sum, without splitting). I'm also 99% sure I want sum([NA]) to return the same that sum([0, NA]) returns, following the same arguments, and possibly more. I would probably like to have sum([0, NA]) (and sum([NA]) both return NA (I was initially surprised by this deviation from numpy) but since you don't mention this, it's apparently not open for discussion (and I understand, given that it would break a lot of code). So assuming my interpretation is correct, I am enormously in favour of your option 1. While we're talking about this: I guess same applies to pd.Series([]).prod() which should (in my view) return 1? Pietro From tom.augspurger88 at gmail.com Sat Dec 2 08:59:36 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sat, 2 Dec 2017 07:59:36 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1512205354.2389.90.camel@pietrobattiston.it> References: <1512205354.2389.90.camel@pietrobattiston.it> Message-ID: On Sat, Dec 2, 2017 at 3:02 AM, Pietro Battiston wrote: > Hi Joris, > > Il giorno ven, 01/12/2017 alle 02.09 +0100, Joris Van den Bossche ha > scritto: > > [...]We see three different options for the default behaviour of sum > > for > > those two cases of empty and all-NA series: > > > > Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 > > > > Behaviour of pandas < 0.21 + bottleneck installed > > Consistent with NumPy, R, MATLAB, etc. (given you use the variant > > that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) > > > > Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA > > > > The behaviour that is introduced in 0.21.0 > > Consistent with SQL (although often (rightly or not) complained > > about) > > > > Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA > > > > Behaviour of pandas < 0.21 (without bottleneck installed) > > A practicable compromise (having SUM([NA]) keep the information of > > NA, while SUM([]) = 0 does not introduce NAs when there were no in > > the data) > > But somewhat inconsistent and unique to pandas ? > > > I'm 100% sure I want sum([]) to return 0, not NA. It's not just more > elegant, it also makes much more sense in my daily workflow (e.g. if I > randomly split in samples, groupby, sum, and then sum the results for > each group across samples, I want the same result as if I just groupby > and sum, without splitting). > > I'm also 99% sure I want sum([NA]) to return the same that sum([0, NA]) > returns, following the same arguments, and possibly more. > > I would probably like to have sum([0, NA]) (and sum([NA]) both return > NA (I was initially surprised by this deviation from numpy) but since > you don't mention this, it's apparently not open for discussion (and I > understand, given that it would break a lot of code). > > So assuming my interpretation is correct, I am enormously in favour of > your option 1. > > While we're talking about this: I guess same applies to > pd.Series([]).prod() which should (in my view) return 1? > Yes, I think there's agreement that if we go with option 1, we would also want to make prod behave similarly, with 1 as the unit instead of 0. Tom > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon.mease at gmail.com Sat Dec 2 09:26:02 2017 From: jon.mease at gmail.com (Jon Mease) Date: Sat, 2 Dec 2017 09:26:02 -0500 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: Hi all, I'd just like to chime in to say that I find option 3 to be the most intuitive. Also, I just checked and option 3 is the behavior of MATLAB/Octave with the default sum function (not nansum). Below is a console session demonstrating this behavior from https://octave-online.net/ octave:7> sum([]) ans = 0octave:8> sum([nan]) ans = NaNoctave:9> sum([nan, 0]) ans = NaN Also, to echo Pietro's comment on prod, I find the behavior analogous to option 3 to be intuitive as well. This is also the behavior of MATLAB/Octave (see below) octave:10> prod([]) ans = 1octave:11> prod([nan]) ans = NaNoctave:12> prod([nan, 0]) ans = NaN If this behavior were the default, then option 1 behavior (and MATLAB/Octave nansum behavior) could be attained by setting skipna=True. I would personally prefer for skipna to default to False so that ignoring null values is a conscious choice. Thanks for the consideration, -Jon On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > *[Note for those reading it on the pydata mailing list, please answer to > pandas-dev at python.org to keep discussion > centralised there]* > > > Hi list, > > In pandas 0.21.0 we changed the behaviour of the sum method for empty or > all-NaN Series (to consistently return NaN), see the what's note > . > This change lead to some discussion on github whether this was the right > choice we made. > > But the reach of github is of course limited, and therefore we wanted to > solicit some more feedback on the mailing list. Below is given an overview > of the background of the issue and the different options. > > Please keep in mind that we are not really interested in theoretical > reasons why one of the other option is better or more correct. Each of the > options has it advantages / disadvantages in practice. But it would be very > interesting to hear the consequences in actual example analysis pipelines. > > Best, > Joris > Background > > Before pandas 0.21.0, the behaviour of the sum of an all-NA Series > depended on whether the optional bottleneck dependency was installed. This > inconsistency was in place since the bottleneck 1.0.0 release (February > 2015), and you can read more background on it in the github issue #9422 > . With bottleneck, the > sum of all-NA was zero; without bottleneck, the sum was NaN. > > In [2]: pd.__version__ > > Out[2]: '0.20.3' > > In [3]: pd.options.compute.use_bottleneck = True > > In [4]: Series([np.nan]).sum() > > Out[4]: 0.0 > > In [5]: pd.options.compute.use_bottleneck = False > > In [6]: Series([np.nan]).sum() > > Out[6]: nan > > The sum of an empty series was always 0, with or without bottleneck. > > In [7]: Series([]).sum() > > Out[7]: 0 > > For pandas 0.21, we wanted to fix this inconsistency. The return value > should not depend on whether an optional dependency is installed. After a > lengthy discussion, we opted for the original pandas behaviour to return > NaN. As a result, also the sum of an empty Series was changed to return NaN > (see the what?s new notice here > > ): > > In [2]: pd.__version__ > > Out[2]: '0.21.0' > > In [3]: pd.Series([np.nan]).sum() > > Out[3]: nan > > In [4]: pd.Series([]).sum() > > Out[4]: nan > > However, after the 0.21.0 release more feedback was received about cases > where this choice is not desirable, and due to this feedback, we are > reconsidering the decision. > Options > > We see three different options for the default behaviour of sum for those > two cases of empty and all-NA series: > > > 1. > > Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 > > > > - > > Behaviour of pandas < 0.21 + bottleneck installed > - > > Consistent with NumPy, R, MATLAB, etc. (given you use the variant that > is NA aware: nansum for numpy, na.rm=TRUE for R, ...) > > > > 1. > > Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA > > > > - > > The behaviour that is introduced in 0.21.0 > - > > Consistent with SQL (although often (rightly or not) complained about) > > > > 1. > > Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA > > > > - > > Behaviour of pandas < 0.21 (without bottleneck installed) > - > > A practicable compromise (having SUM([NA]) keep the information of NA, > while SUM([]) = 0 does not introduce NAs when there were no in the data) > - > > But somewhat inconsistent and unique to pandas ? > > > We have to stress that each of those choices can be preferable depending > on the use case and has its advantages and disadvantages. Some might be > more mathematical sound, others might preserve more information about > having missing data, each can be be more consistent with a certain > ecosystem, ? It is clear that there is no ?best? option for all case. > > While we can only choose one of those options as the default behaviour, > each choice can be accompanied by new features that can make it easier for > the user to opt for a different behaviour: > > > - > > When choosing option 1 or 2, we can introduce a new method (eg .total()) > or a keyword to .sum() (eg min_count) to obtain the other behaviour. > - > > When choosing for option 2, we could provide a pd.zeroifna(..) to be > able to convert NaN values from aggregation results into zero?s if desired > (similar to COALESCE(expr, 0) in SQL) > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon.mease at gmail.com Sat Dec 2 09:30:39 2017 From: jon.mease at gmail.com (Jon Mease) Date: Sat, 2 Dec 2017 09:30:39 -0500 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: Seems the plain text version of the Octave code I posted has some artifacts in the archive. Here is a cleaner version. octave:7> sum([]) ans = 0 octave:8> sum([nan]) ans = NaN octave:9> sum([nan, 0]) ans = NaN octave:10> prod([]) ans = 1 octave:11> prod([nan]) ans = NaN octave:12> prod([nan, 0]) ans = NaN On Sat, Dec 2, 2017 at 9:26 AM, Jon Mease wrote: > Hi all, > I'd just like to chime in to say that I find option 3 to be the most > intuitive. Also, I just checked and option 3 is the behavior of > MATLAB/Octave with the default sum function (not nansum). Below is a > console session demonstrating this behavior from > https://octave-online.net/ > > octave:7> sum([]) > ans = 0octave:8> sum([nan]) > ans = NaNoctave:9> sum([nan, 0]) > ans = NaN > > > Also, to echo Pietro's comment on prod, I find the behavior analogous to > option 3 to be intuitive as well. This is also the behavior of > MATLAB/Octave (see below) > > octave:10> prod([]) > ans = 1octave:11> prod([nan]) > ans = NaNoctave:12> prod([nan, 0]) > ans = NaN > > > If this behavior were the default, then option 1 behavior (and > MATLAB/Octave nansum behavior) could be attained by setting skipna=True. > I would personally prefer for skipna to default to False so that ignoring > null values is a conscious choice. > > Thanks for the consideration, > > -Jon > > On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> *[Note for those reading it on the pydata mailing list, please answer to >> pandas-dev at python.org to keep discussion >> centralised there]* >> >> >> Hi list, >> >> In pandas 0.21.0 we changed the behaviour of the sum method for empty or >> all-NaN Series (to consistently return NaN), see the what's note >> . >> This change lead to some discussion on github whether this was the right >> choice we made. >> >> But the reach of github is of course limited, and therefore we wanted to >> solicit some more feedback on the mailing list. Below is given an overview >> of the background of the issue and the different options. >> >> Please keep in mind that we are not really interested in theoretical >> reasons why one of the other option is better or more correct. Each of the >> options has it advantages / disadvantages in practice. But it would be very >> interesting to hear the consequences in actual example analysis pipelines. >> >> Best, >> Joris >> Background >> >> Before pandas 0.21.0, the behaviour of the sum of an all-NA Series >> depended on whether the optional bottleneck dependency was installed. This >> inconsistency was in place since the bottleneck 1.0.0 release (February >> 2015), and you can read more background on it in the github issue #9422 >> . With bottleneck, the >> sum of all-NA was zero; without bottleneck, the sum was NaN. >> >> In [2]: pd.__version__ >> >> Out[2]: '0.20.3' >> >> In [3]: pd.options.compute.use_bottleneck = True >> >> In [4]: Series([np.nan]).sum() >> >> Out[4]: 0.0 >> >> In [5]: pd.options.compute.use_bottleneck = False >> >> In [6]: Series([np.nan]).sum() >> >> Out[6]: nan >> >> The sum of an empty series was always 0, with or without bottleneck. >> >> In [7]: Series([]).sum() >> >> Out[7]: 0 >> >> For pandas 0.21, we wanted to fix this inconsistency. The return value >> should not depend on whether an optional dependency is installed. After a >> lengthy discussion, we opted for the original pandas behaviour to return >> NaN. As a result, also the sum of an empty Series was changed to return NaN >> (see the what?s new notice here >> >> ): >> >> In [2]: pd.__version__ >> >> Out[2]: '0.21.0' >> >> In [3]: pd.Series([np.nan]).sum() >> >> Out[3]: nan >> >> In [4]: pd.Series([]).sum() >> >> Out[4]: nan >> >> However, after the 0.21.0 release more feedback was received about cases >> where this choice is not desirable, and due to this feedback, we are >> reconsidering the decision. >> Options >> >> We see three different options for the default behaviour of sum for >> those two cases of empty and all-NA series: >> >> >> 1. >> >> Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 >> >> >> >> - >> >> Behaviour of pandas < 0.21 + bottleneck installed >> - >> >> Consistent with NumPy, R, MATLAB, etc. (given you use the variant >> that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) >> >> >> >> 1. >> >> Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA >> >> >> >> - >> >> The behaviour that is introduced in 0.21.0 >> - >> >> Consistent with SQL (although often (rightly or not) complained about) >> >> >> >> 1. >> >> Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA >> >> >> >> - >> >> Behaviour of pandas < 0.21 (without bottleneck installed) >> - >> >> A practicable compromise (having SUM([NA]) keep the information of >> NA, while SUM([]) = 0 does not introduce NAs when there were no in the data) >> - >> >> But somewhat inconsistent and unique to pandas ? >> >> >> We have to stress that each of those choices can be preferable depending >> on the use case and has its advantages and disadvantages. Some might be >> more mathematical sound, others might preserve more information about >> having missing data, each can be be more consistent with a certain >> ecosystem, ? It is clear that there is no ?best? option for all case. >> >> While we can only choose one of those options as the default behaviour, >> each choice can be accompanied by new features that can make it easier for >> the user to opt for a different behaviour: >> >> >> - >> >> When choosing option 1 or 2, we can introduce a new method (eg >> .total()) or a keyword to .sum() (eg min_count) to obtain the other >> behaviour. >> - >> >> When choosing for option 2, we could provide a pd.zeroifna(..) to be >> able to convert NaN values from aggregation results into zero?s if desired >> (similar to COALESCE(expr, 0) in SQL) >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sat Dec 2 19:46:05 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 03 Dec 2017 00:46:05 +0000 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: We have no plans to change the default value of skipna. All of these proposals concern the behavior of skipna=True. skipna=False is what corresponds to sum in NumPy/R/Matlab, and pandas is already fully consistent there. I don't see consistency between pandas with skipna=True and the non-NaN-skipping sum inthese other languages as relevant or desirable. On Sat, Dec 2, 2017 at 6:26 AM Jon Mease wrote: > Hi all, > I'd just like to chime in to say that I find option 3 to be the most > intuitive. Also, I just checked and option 3 is the behavior of > MATLAB/Octave with the default sum function (not nansum). Below is a > console session demonstrating this behavior from > https://octave-online.net/ > > octave:7> sum([]) > ans = 0octave:8> sum([nan]) > ans = NaNoctave:9> sum([nan, 0]) > ans = NaN > > > Also, to echo Pietro's comment on prod, I find the behavior analogous to > option 3 to be intuitive as well. This is also the behavior of > MATLAB/Octave (see below) > > octave:10> prod([]) > ans = 1octave:11> prod([nan]) > ans = NaNoctave:12> prod([nan, 0]) > ans = NaN > > > If this behavior were the default, then option 1 behavior (and > MATLAB/Octave nansum behavior) could be attained by setting skipna=True. > I would personally prefer for skipna to default to False so that ignoring > null values is a conscious choice. > > Thanks for the consideration, > > -Jon > > On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> *[Note for those reading it on the pydata mailing list, please answer to >> pandas-dev at python.org to keep discussion >> centralised there]* >> >> >> Hi list, >> >> In pandas 0.21.0 we changed the behaviour of the sum method for empty or >> all-NaN Series (to consistently return NaN), see the what's note >> . >> This change lead to some discussion on github whether this was the right >> choice we made. >> >> But the reach of github is of course limited, and therefore we wanted to >> solicit some more feedback on the mailing list. Below is given an overview >> of the background of the issue and the different options. >> >> Please keep in mind that we are not really interested in theoretical >> reasons why one of the other option is better or more correct. Each of the >> options has it advantages / disadvantages in practice. But it would be very >> interesting to hear the consequences in actual example analysis pipelines. >> >> Best, >> Joris >> Background >> >> Before pandas 0.21.0, the behaviour of the sum of an all-NA Series >> depended on whether the optional bottleneck dependency was installed. This >> inconsistency was in place since the bottleneck 1.0.0 release (February >> 2015), and you can read more background on it in the github issue #9422 >> . With bottleneck, the >> sum of all-NA was zero; without bottleneck, the sum was NaN. >> >> In [2]: pd.__version__ >> >> Out[2]: '0.20.3' >> >> In [3]: pd.options.compute.use_bottleneck = True >> >> In [4]: Series([np.nan]).sum() >> >> Out[4]: 0.0 >> >> In [5]: pd.options.compute.use_bottleneck = False >> >> In [6]: Series([np.nan]).sum() >> >> Out[6]: nan >> >> The sum of an empty series was always 0, with or without bottleneck. >> >> In [7]: Series([]).sum() >> >> Out[7]: 0 >> >> For pandas 0.21, we wanted to fix this inconsistency. The return value >> should not depend on whether an optional dependency is installed. After a >> lengthy discussion, we opted for the original pandas behaviour to return >> NaN. As a result, also the sum of an empty Series was changed to return NaN >> (see the what?s new notice here >> >> ): >> >> In [2]: pd.__version__ >> >> Out[2]: '0.21.0' >> >> In [3]: pd.Series([np.nan]).sum() >> >> Out[3]: nan >> >> In [4]: pd.Series([]).sum() >> >> Out[4]: nan >> >> However, after the 0.21.0 release more feedback was received about cases >> where this choice is not desirable, and due to this feedback, we are >> reconsidering the decision. >> Options >> >> We see three different options for the default behaviour of sum for >> those two cases of empty and all-NA series: >> >> >> 1. >> >> Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 >> >> >> >> - >> >> Behaviour of pandas < 0.21 + bottleneck installed >> - >> >> Consistent with NumPy, R, MATLAB, etc. (given you use the variant >> that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) >> >> >> >> 1. >> >> Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA >> >> >> >> - >> >> The behaviour that is introduced in 0.21.0 >> - >> >> Consistent with SQL (although often (rightly or not) complained about) >> >> >> >> 1. >> >> Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA >> >> >> >> - >> >> Behaviour of pandas < 0.21 (without bottleneck installed) >> - >> >> A practicable compromise (having SUM([NA]) keep the information of >> NA, while SUM([]) = 0 does not introduce NAs when there were no in the data) >> - >> >> But somewhat inconsistent and unique to pandas ? >> >> >> We have to stress that each of those choices can be preferable depending >> on the use case and has its advantages and disadvantages. Some might be >> more mathematical sound, others might preserve more information about >> having missing data, each can be be more consistent with a certain >> ecosystem, ? It is clear that there is no ?best? option for all case. >> >> While we can only choose one of those options as the default behaviour, >> each choice can be accompanied by new features that can make it easier for >> the user to opt for a different behaviour: >> >> >> - >> >> When choosing option 1 or 2, we can introduce a new method (eg >> .total()) or a keyword to .sum() (eg min_count) to obtain the other >> behaviour. >> - >> >> When choosing for option 2, we could provide a pd.zeroifna(..) to be >> able to convert NaN values from aggregation results into zero?s if desired >> (similar to COALESCE(expr, 0) in SQL) >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Dec 2 20:32:38 2017 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 2 Dec 2017 17:32:38 -0800 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: On Thu, Nov 30, 2017 at 5:09 PM, Joris Van den Bossche wrote: > Options > > We see three different options for the default behaviour of sum for those > two cases of empty and all-NA series: > > > Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 > > Behaviour of pandas < 0.21 + bottleneck installed > > Consistent with NumPy, R, MATLAB, etc. (given you use the variant that is NA > aware: nansum for numpy, na.rm=TRUE for R, ...) > > > Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA > > > The behaviour that is introduced in 0.21.0 > > Consistent with SQL (although often (rightly or not) complained about) > > > Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA > > > > Behaviour of pandas < 0.21 (without bottleneck installed) > > A practicable compromise (having SUM([NA]) keep the information of NA, while > SUM([]) = 0 does not introduce NAs when there were no in the data) > > But somewhat inconsistent and unique to pandas ? > > > We have to stress that each of those choices can be preferable depending on > the use case and has its advantages and disadvantages. Some might be more > mathematical sound, others might preserve more information about having > missing data, each can be be more consistent with a certain ecosystem, ? It > is clear that there is no ?best? option for all case. I understand you want to try to avoid bikeshedding here, but it's hard to discuss without any rationales at all :-). I am baffled by the idea that sum([]) would return NaN. I'm sure there's some benefits, I just can't think of any. (OK, SQL does it, but SQL contains all kinds of indefensible things...) I am baffled by the idea that sum([]) and sum([NaN], skipna=True) would return different values. I'm sure there's some benefits, I just can't think of any. Can someone who does understand the trade-offs explain? The email says that sum([NaN], skipna=True) returning NaN is "preserving information", and briefly skimming github issue #9422 I see some arguments that "NA should propagate", but I don't understand why it's crucial that NaN should propagate when you have [NaN] but that it shouldn't propagate for [NaN, 1]. I was particularly confused by this comment from Jef, which I will paraphrase in a rude way to make my point (click the link for the original text): https://github.com/pandas-dev/pandas/issues/9422#issuecomment-169508202 > see the point is pandas [is intentionally designed not to propagate nans in sum unless you specifically propagate them], so we basically use nansum like behavior for everthing. The issues is if you ONLY have nans then numpy says it should be 0, because nansum 'skips' nans. > But in pandas that is completely misleading and lossy, because nans by definition propogate (unless you specifically dont' propogate them). So... pandas chooses not to propagate nans by default, and this is misleading and lossy because nans should propagate by default? I actually agree with this, but this is an argument that skipna=False should be the default, not that there should be a special case where NaN propagation gets flipped on and off depending on the values in an array. I guess one argument for option 3 could be that skipna=True was a mistake -- it makes it easier to get *some* result but also increases the chance of silently getting garbage -- but now we're stuck with it, and at least option 3 lets us inch towards the skipna=False behavior? -n -- Nathaniel J. Smith -- https://vorpus.org From jon.mease at gmail.com Sat Dec 2 20:53:34 2017 From: jon.mease at gmail.com (Jon Mease) Date: Sat, 2 Dec 2017 20:53:34 -0500 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: Ok, thanks for the clarification. I missed the fact that these proposals all assume skipna=True. If we stick with option 2, should Series([]).sum(skipna=False) also equal NaN? This seems to be the behavior of version 0.21, but this is no longer consistent with the non-NaN-skipping version of sum in NumPy/MATLAB (which equal 0). -Jon On Sat, Dec 2, 2017 at 7:46 PM, Stephan Hoyer wrote: > We have no plans to change the default value of skipna. All of these > proposals concern the behavior of skipna=True. > > skipna=False is what corresponds to sum in NumPy/R/Matlab, and pandas is > already fully consistent there. I don't see consistency between pandas with > skipna=True and the non-NaN-skipping sum inthese other languages as > relevant or desirable. > > On Sat, Dec 2, 2017 at 6:26 AM Jon Mease wrote: > >> Hi all, >> I'd just like to chime in to say that I find option 3 to be the most >> intuitive. Also, I just checked and option 3 is the behavior of >> MATLAB/Octave with the default sum function (not nansum). Below is a >> console session demonstrating this behavior from >> https://octave-online.net/ >> >> octave:7> sum([]) >> ans = 0octave:8> sum([nan]) >> ans = NaNoctave:9> sum([nan, 0]) >> ans = NaN >> >> >> Also, to echo Pietro's comment on prod, I find the behavior analogous to >> option 3 to be intuitive as well. This is also the behavior of >> MATLAB/Octave (see below) >> >> octave:10> prod([]) >> ans = 1octave:11> prod([nan]) >> ans = NaNoctave:12> prod([nan, 0]) >> ans = NaN >> >> >> If this behavior were the default, then option 1 behavior (and >> MATLAB/Octave nansum behavior) could be attained by setting skipna=True. >> I would personally prefer for skipna to default to False so that >> ignoring null values is a conscious choice. >> >> Thanks for the consideration, >> >> -Jon >> >> On Thu, Nov 30, 2017 at 8:09 PM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> *[Note for those reading it on the pydata mailing list, please answer to >>> pandas-dev at python.org to keep discussion >>> centralised there]* >>> >>> >>> Hi list, >>> >>> In pandas 0.21.0 we changed the behaviour of the sum method for empty or >>> all-NaN Series (to consistently return NaN), see the what's note >>> . >>> This change lead to some discussion on github whether this was the right >>> choice we made. >>> >>> But the reach of github is of course limited, and therefore we wanted to >>> solicit some more feedback on the mailing list. Below is given an overview >>> of the background of the issue and the different options. >>> >>> Please keep in mind that we are not really interested in theoretical >>> reasons why one of the other option is better or more correct. Each of the >>> options has it advantages / disadvantages in practice. But it would be very >>> interesting to hear the consequences in actual example analysis pipelines. >>> >>> Best, >>> Joris >>> Background >>> >>> Before pandas 0.21.0, the behaviour of the sum of an all-NA Series >>> depended on whether the optional bottleneck dependency was installed. This >>> inconsistency was in place since the bottleneck 1.0.0 release (February >>> 2015), and you can read more background on it in the github issue #9422 >>> . With bottleneck, >>> the sum of all-NA was zero; without bottleneck, the sum was NaN. >>> >>> In [2]: pd.__version__ >>> >>> Out[2]: '0.20.3' >>> >>> In [3]: pd.options.compute.use_bottleneck = True >>> >>> In [4]: Series([np.nan]).sum() >>> >>> Out[4]: 0.0 >>> >>> In [5]: pd.options.compute.use_bottleneck = False >>> >>> In [6]: Series([np.nan]).sum() >>> >>> Out[6]: nan >>> >>> The sum of an empty series was always 0, with or without bottleneck. >>> >>> In [7]: Series([]).sum() >>> >>> Out[7]: 0 >>> >>> For pandas 0.21, we wanted to fix this inconsistency. The return value >>> should not depend on whether an optional dependency is installed. After a >>> lengthy discussion, we opted for the original pandas behaviour to return >>> NaN. As a result, also the sum of an empty Series was changed to return NaN >>> (see the what?s new notice here >>> >>> ): >>> >>> In [2]: pd.__version__ >>> >>> Out[2]: '0.21.0' >>> >>> In [3]: pd.Series([np.nan]).sum() >>> >>> Out[3]: nan >>> >>> In [4]: pd.Series([]).sum() >>> >>> Out[4]: nan >>> >>> However, after the 0.21.0 release more feedback was received about cases >>> where this choice is not desirable, and due to this feedback, we are >>> reconsidering the decision. >>> Options >>> >>> We see three different options for the default behaviour of sum for >>> those two cases of empty and all-NA series: >>> >>> >>> 1. >>> >>> Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 >>> >>> >>> >>> - >>> >>> Behaviour of pandas < 0.21 + bottleneck installed >>> - >>> >>> Consistent with NumPy, R, MATLAB, etc. (given you use the variant >>> that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) >>> >>> >>> >>> 1. >>> >>> Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA >>> >>> >>> >>> - >>> >>> The behaviour that is introduced in 0.21.0 >>> - >>> >>> Consistent with SQL (although often (rightly or not) complained >>> about) >>> >>> >>> >>> 1. >>> >>> Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA >>> >>> >>> >>> - >>> >>> Behaviour of pandas < 0.21 (without bottleneck installed) >>> - >>> >>> A practicable compromise (having SUM([NA]) keep the information of >>> NA, while SUM([]) = 0 does not introduce NAs when there were no in the data) >>> - >>> >>> But somewhat inconsistent and unique to pandas ? >>> >>> >>> We have to stress that each of those choices can be preferable depending >>> on the use case and has its advantages and disadvantages. Some might be >>> more mathematical sound, others might preserve more information about >>> having missing data, each can be be more consistent with a certain >>> ecosystem, ? It is clear that there is no ?best? option for all case. >>> >>> While we can only choose one of those options as the default behaviour, >>> each choice can be accompanied by new features that can make it easier for >>> the user to opt for a different behaviour: >>> >>> >>> - >>> >>> When choosing option 1 or 2, we can introduce a new method (eg >>> .total()) or a keyword to .sum() (eg min_count) to obtain the other >>> behaviour. >>> - >>> >>> When choosing for option 2, we could provide a pd.zeroifna(..) to be >>> able to convert NaN values from aggregation results into zero?s if desired >>> (similar to COALESCE(expr, 0) in SQL) >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at stevesimmons.com Sun Dec 3 05:34:06 2017 From: mail at stevesimmons.com (Stephen Simmons) Date: Sun, 3 Dec 2017 11:34:06 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or, all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: <8eec99f0-8d1b-56dd-20ba-1b677cc0f4c7@stevesimmons.com> Nat Smith wrote: > I am baffled by the idea that sum([]) would return NaN. So am I. Here are two cases that leave me confused what the intention is. Case #1 - Summing an empty integer series Not only does the answer change from 0 to NaN, but the type changes from int to float. That occurs whether skipna is True or False! > pd.Series([], dtype=int).sum() nan > pd.Series([], dtype=int).sum(skipna=True) nan > pd.Series([], dtype=int).sum(skipna=False) nan This confused me so I went back to the docstring and tried it with a float Series: > pd.Series.sum? Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs) Docstring: Return the sum of the values for the requested axis Parameters ---------- axis : {index (0)} skipna : boolean, default True ??? Exclude NA/null values. If an entire row/column is NA or empty, the result ??? will be NA level : int or level name, default None ??? If the axis is a MultiIndex (hierarchical), count along a ??? particular level, collapsing into a scalar numeric_only : boolean, default None ??? Include only float, int, boolean columns. If None, will attempt to use ??? everything, then use only numeric data. Not implemented for Series. I would expect skipna being True to mean we don't want NaNs affecting the sum. So why would we want NaN when the series is empty? In fact, for an empty series, skipna gives the same NaN output for both skipna=True and skipna=False: > pd.Series([], dtype=float).sum(skipna=False) nan >pd.Series([], dtype=float).sum(skipna=True) nan This looks even more weird in this case: > pd.Series([0, float('nan')], dtype=float).sum(skipna=True) 0.0??? # NaN is skipped, sum is non-NaN. So far so good... So what happens with different non-empty input? > pd.Series([float('nan')], dtype=float).sum(skipna=True) nan?? # Skip all NaNs, get empty series to sum, so return NaN??? So if we want to avoid NaNs in our output, the skipna parameter doesn't help. For every use of sum(), we now need to separately check two special cases: - empty input - input with only NaNs I can't see how this behaviour helps anyone! Regards Stephen From ml at pietrobattiston.it Sun Dec 3 11:17:44 2017 From: ml at pietrobattiston.it (Pietro Battiston) Date: Sun, 03 Dec 2017 17:17:44 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: <1512317864.2389.102.camel@pietrobattiston.it> Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: > [...] I think Nathaniel just expressed my thoughts better than I was/would be able to! Pietro From wesmckinn at gmail.com Mon Dec 4 11:05:19 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Dec 2017 11:05:19 -0500 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1512317864.2389.102.camel@pietrobattiston.it> References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: We have been discussing this amongst the pandas core developers for some time, and the general consensus is to adopt Option 1 (sum of all-NA or empty is 0) as the behavior for sum with skipna=True. In a groupby setting, and with categorical group keys, the issue becomes a bit more nuanced -- if you group by a categorical, and one of the categories is not observed at all in the dataset, e.g: s.groupby(some_categorical).sum() This change will necessarily yield a Series containing no nulls -- so if there is a category containing no data, then the sum for that category is 0. For the sake of algebraic completeness, I believe we should introduce a new aggregation method that performs Option 2 (equivalent to what pandas 0.21.0 is currently doing for sum()), so that empty or all-NA yields NA. So the TL;DR is: * We should prepare a 0.21.1 release in short order with Option 1 implemented for sum() (always 0 for empty/all-null) and prod() (1, respectively) * Add a new method for Option 2, either in 0.21.1 or in a later release We should probably alert the long GitHub thread that this discussion is taking place before we cut the release. Since GitHub comments can be permanently deleted at any time, I think it's better for discussions about significant issues like this to take place on the permanent public record. Thanks Wes On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston wrote: > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >> [...] > > I think Nathaniel just expressed my thoughts better than I was/would be > able to! > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From ml at pietrobattiston.it Mon Dec 4 12:08:08 2017 From: ml at pietrobattiston.it (Pietro Battiston) Date: Mon, 04 Dec 2017 18:08:08 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: <1512407288.2389.114.camel@pietrobattiston.it> Il giorno lun, 04/12/2017 alle 11.05 -0500, Wes McKinney ha scritto: > [...] > In a groupby setting, and with categorical group keys, the issue > becomes a bit more nuanced -- if you group by a categorical, and one > of the categories is not observed at all in the dataset, e.g: > > s.groupby(some_categorical).sum() > > This change will necessarily yield a Series containing no nulls -- so > if there is a category containing no data, then the sum for that > category is 0. > > For the sake of algebraic completeness, I believe we should introduce > a new aggregation method that performs Option 2 (equivalent to what > pandas 0.21.0 is currently doing for sum()), so that empty or all-NA > yields NA. If I understand correctly, you have in mind a replacement for groupby such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- observed categories... assuming this is really a necessity, wouldn't it be better satisfied by an argument to groupby() which entirely drops unused categories, than by a new method? I understand it's not ideal to add an argument which only applies to Categorical groupers,?but even letting aside the groupby().sum() issue, this is something which I think many user would appreciate. Pietro From wesmckinn at gmail.com Mon Dec 4 12:12:42 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Dec 2017 12:12:42 -0500 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1512407288.2389.114.camel@pietrobattiston.it> References: <1512317864.2389.102.camel@pietrobattiston.it> <1512407288.2389.114.camel@pietrobattiston.it> Message-ID: > If I understand correctly, you have in mind a replacement for groupby such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- observed categories No, I am proposing to add a new aggregation method (an alternative to "sum"). So something like s.groupby(...).total() or s.groupby(...).null_sum() (names are hard) - Wes On Mon, Dec 4, 2017 at 12:08 PM, Pietro Battiston wrote: > Il giorno lun, 04/12/2017 alle 11.05 -0500, Wes McKinney ha scritto: >> [...] >> In a groupby setting, and with categorical group keys, the issue >> becomes a bit more nuanced -- if you group by a categorical, and one >> of the categories is not observed at all in the dataset, e.g: >> >> s.groupby(some_categorical).sum() >> >> This change will necessarily yield a Series containing no nulls -- so >> if there is a category containing no data, then the sum for that >> category is 0. >> >> For the sake of algebraic completeness, I believe we should introduce >> a new aggregation method that performs Option 2 (equivalent to what >> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >> yields NA. > > If I understand correctly, you have in mind a replacement for groupby > such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- > observed categories... assuming this is really a necessity, wouldn't it > be better satisfied by an argument to groupby() which entirely drops > unused categories, than by a new method? > I understand it's not ideal to add an argument which only applies to > Categorical groupers, but even letting aside the groupby().sum() issue, > this is something which I think many user would appreciate. > > Pietro From jeffreback at gmail.com Mon Dec 4 13:11:50 2017 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 4 Dec 2017 13:11:50 -0500 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: > We have been discussing this amongst the pandas core developers for some time, and the general consensus is to adopt Option 1 (sum of all-NA or empty is 0) as the behavior for sum with skipna=True. Actually, no there has not been general consensus among the core developers. Everyone loves to say that s.sum([NA]) == 0 makes a ton of sense, but then you have my simple example from original issue, which Nathaniel did quote and I'll repeat here (with a small modification): In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) In [3]: df Out[3]: A B 0 NaN NaN 1 NaN 0.0 In [4]: df.sum() Out[4]: A NaN B 0.0 dtype: float64 Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact that you have 0 present in B. If you conflate these, you then have a situation where I do not know that I had a valid value in B. Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose information. No argument has been presented at all why this should not hold. >From [4] it follows that sum([NA]) must be NA. I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA is more consistent with the rest of pandas (IOW *every* other operation on an empty Series returns NA). > * We should prepare a 0.21.1 release in short order with Option 1 implemented for sum() (always 0 for empty/all-null) and prod() (1, respectively) I can certainly understand pandas reverting back to the de-facto state of affairs prior to 0.21.0, which would be option 3, but a radical change on a minor release is not warranted at all. Frankly, we only have (and are likely to get) even a small fraction of users opinions on this whole matter. Jeff On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney wrote: > We have been discussing this amongst the pandas core developers for > some time, and the general consensus is to adopt Option 1 (sum of > all-NA or empty is 0) as the behavior for sum with skipna=True. > > In a groupby setting, and with categorical group keys, the issue > becomes a bit more nuanced -- if you group by a categorical, and one > of the categories is not observed at all in the dataset, e.g: > > s.groupby(some_categorical).sum() > > This change will necessarily yield a Series containing no nulls -- so > if there is a category containing no data, then the sum for that > category is 0. > > For the sake of algebraic completeness, I believe we should introduce > a new aggregation method that performs Option 2 (equivalent to what > pandas 0.21.0 is currently doing for sum()), so that empty or all-NA > yields NA. > > So the TL;DR is: > > * We should prepare a 0.21.1 release in short order with Option 1 > implemented for sum() (always 0 for empty/all-null) and prod() (1, > respectively) > * Add a new method for Option 2, either in 0.21.1 or in a later release > > We should probably alert the long GitHub thread that this discussion > is taking place before we cut the release. Since GitHub comments can > be permanently deleted at any time, I think it's better for > discussions about significant issues like this to take place on the > permanent public record. > > Thanks > Wes > > On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston > wrote: > > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: > >> [...] > > > > I think Nathaniel just expressed my thoughts better than I was/would be > > able to! > > > > Pietro > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Dec 4 13:27:09 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 4 Dec 2017 12:27:09 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback wrote: > > We have been discussing this amongst the pandas core developers for > some time, and the general consensus is to adopt Option 1 (sum of > all-NA or empty is 0) as the behavior for sum with skipna=True. > > Actually, no there has not been general consensus among the core > developers. > I think that's the preference of the majority though. > Everyone loves to say that > > s.sum([NA]) == 0 makes a ton of sense, but then you have my simple example > from original issue, which Nathaniel did quote and I'll repeat here (with > a small modification): > > In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) > > In [3]: df > Out[3]: > A B > 0 NaN NaN > 1 NaN 0.0 > > In [4]: df.sum() > Out[4]: > A NaN > B 0.0 > dtype: float64 > > > Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact > that you have 0 > present in B. If you conflate these, you then have a situation where I do > not > know that I had a valid value in B. > > Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose > information. No argument has been presented at all why this should not > hold. > > From [4] it follows that sum([NA]) must be NA. > Extending that slightly: In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]}) In [5]: df.sum() Out[5]: A NaN B 0.0 C 0.0 dtype: float64 This is why I don't think the "preserving information" argument is correct. Taking "Preserving information" to its logical conclusion would return NaN for "B", since that distinguishes between the sum of all valid and the the sum with some NaNs. I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA > is more consistent with > the rest of pandas (IOW *every* other operation on an empty Series returns > NA). > > > * We should prepare a 0.21.1 release in short order with Option 1 > implemented for sum() (always 0 for empty/all-null) and prod() (1, > respectively) > > I can certainly understand pandas reverting back to the de-facto state of > affairs prior > to 0.21.0, which would be option 3, but a radical change on a minor > release is > not warranted at all. Frankly, we only have (and are likely to get) even a > small > fraction of users opinions on this whole matter. > Yeah, agreed that bumping to 0.22 is for the best. > Jeff > > > On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney wrote: > >> We have been discussing this amongst the pandas core developers for >> some time, and the general consensus is to adopt Option 1 (sum of >> all-NA or empty is 0) as the behavior for sum with skipna=True. >> >> In a groupby setting, and with categorical group keys, the issue >> becomes a bit more nuanced -- if you group by a categorical, and one >> of the categories is not observed at all in the dataset, e.g: >> >> s.groupby(some_categorical).sum() >> >> This change will necessarily yield a Series containing no nulls -- so >> if there is a category containing no data, then the sum for that >> category is 0. >> >> For the sake of algebraic completeness, I believe we should introduce >> a new aggregation method that performs Option 2 (equivalent to what >> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >> yields NA. >> >> So the TL;DR is: >> >> * We should prepare a 0.21.1 release in short order with Option 1 >> implemented for sum() (always 0 for empty/all-null) and prod() (1, >> respectively) >> * Add a new method for Option 2, either in 0.21.1 or in a later release >> >> We should probably alert the long GitHub thread that this discussion >> is taking place before we cut the release. Since GitHub comments can >> be permanently deleted at any time, I think it's better for >> discussions about significant issues like this to take place on the >> permanent public record. >> >> Thanks >> Wes >> >> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston >> wrote: >> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >> >> [...] >> > >> > I think Nathaniel just expressed my thoughts better than I was/would be >> > able to! >> > >> > Pietro >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbartak at gmail.com Mon Dec 4 17:14:04 2017 From: cbartak at gmail.com (Chris Bartak) Date: Mon, 4 Dec 2017 16:14:04 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: Here's a brief 'dissenting option' for option #2. To be clear I'm not really trying to convince anyone, and I am OK reverting to option #1 but here's the rationale I came to pandas from more a of SQL/BI/Excel/etc background rather than a scientific computing one. I think there are two things (biases) that came along with this: 1) Majority of things done with pandas were from externally generated data, generally 'messy' 2) Core abstraction / unit of thought was *entire columns*. A column is not a collection of scalar values, or an ndarray wrapper, or etc.., it was generally the lowest level thing I work with. >From that point of view, option #2, though at some level inconsistent, is actually convenient. Missing data *within *a column is normal and generally expected from whatever I'm parsing, so it's nice that aggregations just work. An *entirely missing *column is exceptional - I'm happy that information propagates through aggregations and lets me know something is likely wrong. On Mon, Dec 4, 2017 at 12:27 PM, Tom Augspurger wrote: > > > On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback wrote: > >> > We have been discussing this amongst the pandas core developers for >> some time, and the general consensus is to adopt Option 1 (sum of >> all-NA or empty is 0) as the behavior for sum with skipna=True. >> >> Actually, no there has not been general consensus among the core >> developers. >> > > I think that's the preference of the majority though. > > >> Everyone loves to say that >> >> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple >> example >> from original issue, which Nathaniel did quote and I'll repeat here (with >> a small modification): >> >> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) >> >> In [3]: df >> Out[3]: >> A B >> 0 NaN NaN >> 1 NaN 0.0 >> >> In [4]: df.sum() >> Out[4]: >> A NaN >> B 0.0 >> dtype: float64 >> >> >> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact >> that you have 0 >> present in B. If you conflate these, you then have a situation where I do >> not >> know that I had a valid value in B. >> >> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose >> information. No argument has been presented at all why this should not >> hold. >> >> From [4] it follows that sum([NA]) must be NA. >> > > Extending that slightly: > > > In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': > [0, 0]}) > > In [5]: df.sum() > Out[5]: > A NaN > B 0.0 > C 0.0 > dtype: float64 > > This is why I don't think the "preserving information" argument is > correct. Taking "Preserving information" > to its logical conclusion would return NaN for "B", since that > distinguishes between the sum of all > valid and the the sum with some NaNs. > > I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA >> is more consistent with >> the rest of pandas (IOW *every* other operation on an empty Series >> returns NA). >> >> > * We should prepare a 0.21.1 release in short order with Option 1 >> implemented for sum() (always 0 for empty/all-null) and prod() (1, >> respectively) >> >> I can certainly understand pandas reverting back to the de-facto state of >> affairs prior >> to 0.21.0, which would be option 3, but a radical change on a minor >> release is >> not warranted at all. Frankly, we only have (and are likely to get) even >> a small >> fraction of users opinions on this whole matter. >> > > Yeah, agreed that bumping to 0.22 is for the best. > > >> Jeff >> >> >> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney >> wrote: >> >>> We have been discussing this amongst the pandas core developers for >>> some time, and the general consensus is to adopt Option 1 (sum of >>> all-NA or empty is 0) as the behavior for sum with skipna=True. >>> >>> In a groupby setting, and with categorical group keys, the issue >>> becomes a bit more nuanced -- if you group by a categorical, and one >>> of the categories is not observed at all in the dataset, e.g: >>> >>> s.groupby(some_categorical).sum() >>> >>> This change will necessarily yield a Series containing no nulls -- so >>> if there is a category containing no data, then the sum for that >>> category is 0. >>> >>> For the sake of algebraic completeness, I believe we should introduce >>> a new aggregation method that performs Option 2 (equivalent to what >>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >>> yields NA. >>> >>> So the TL;DR is: >>> >>> * We should prepare a 0.21.1 release in short order with Option 1 >>> implemented for sum() (always 0 for empty/all-null) and prod() (1, >>> respectively) >>> * Add a new method for Option 2, either in 0.21.1 or in a later release >>> >>> We should probably alert the long GitHub thread that this discussion >>> is taking place before we cut the release. Since GitHub comments can >>> be permanently deleted at any time, I think it's better for >>> discussions about significant issues like this to take place on the >>> permanent public record. >>> >>> Thanks >>> Wes >>> >>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston >>> wrote: >>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >>> >> [...] >>> > >>> > I think Nathaniel just expressed my thoughts better than I was/would be >>> > able to! >>> > >>> > Pietro >>> > _______________________________________________ >>> > Pandas-dev mailing list >>> > Pandas-dev at python.org >>> > https://mail.python.org/mailman/listinfo/pandas-dev >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Dec 4 17:17:50 2017 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 4 Dec 2017 14:17:50 -0800 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1512407288.2389.114.camel@pietrobattiston.it> Message-ID: On Mon, Dec 4, 2017 at 9:12 AM, Wes McKinney wrote: >> If I understand correctly, you have in mind a replacement for groupby > such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- > observed categories > > No, I am proposing to add a new aggregation method (an alternative to > "sum"). So something like > > s.groupby(...).total() > > or > > s.groupby(...).null_sum() > > (names are hard) Another spelling to consider would be something like sum(skipna="if_any_valid") -n -- Nathaniel J. Smith -- https://vorpus.org From wesmckinn at gmail.com Mon Dec 4 20:52:38 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Dec 2017 20:52:38 -0500 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1512407288.2389.114.camel@pietrobattiston.it> Message-ID: To Jeff's point re this example: In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) In [3]: df Out[3]: A B 0 NaN NaN 1 NaN 0.0 In [4]: df.sum() Out[4]: A NaN B 0.0 dtype: float64 By adding a function which behaves in this way, but with a different name, we keep the behavior available to the discerning user for whom this distinction is meaningful. For other users, for whom this is not meaningful, we give df.sum() the same meaning as df.sum().fillna(0). It's hard to predict which choice will cause the most or least harm to users. In either case, we cannot spare our users the expectation of some education about the behavior in the presence of missing (or no) data. My guess is that the all-NA -> 0 behavior does the least harm by default to the average user, because aggregates used in computations like weighted sums will not propagate NaNs. If we need to bump to 0.22.0 to resolve the matter and add the new function for Option 2 (in the event that we make Option 1 the behavior of sum, which is my preference), that seems OK. If there are users that are unsatisfied with the new behavior, we can at least defend ourselves with the example set by NumPy's np.nansum and R's sum with na.rm=T. Having the alternative method available for Option 2 IMHO should be sufficient to satisfy such demanding users. - Wes On Mon, Dec 4, 2017 at 5:17 PM, Nathaniel Smith wrote: > On Mon, Dec 4, 2017 at 9:12 AM, Wes McKinney wrote: >>> If I understand correctly, you have in mind a replacement for groupby >> such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- >> observed categories >> >> No, I am proposing to add a new aggregation method (an alternative to >> "sum"). So something like >> >> s.groupby(...).total() >> >> or >> >> s.groupby(...).null_sum() >> >> (names are hard) > > Another spelling to consider would be something like sum(skipna="if_any_valid") > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org From jorisvandenbossche at gmail.com Tue Dec 5 11:46:20 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 5 Dec 2017 17:46:20 +0100 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: I think that could be a possibility, to force people to explicitly specify the axis in select(), but it would still be only in the long term that people can then actually drop this specification if they want to select the columns. But maybe that's not too bad? Other possibility is another name, and then something like select_columns() (or select_labels() if we don't want to have it specifically for columns) is maybe an option? 2017-11-29 1:29 GMT+01:00 Jon Mease : > Perhaps for versions 0.21.1 and 0.22 a warning could be issued when > .select() is used without an explicit `axis` parameter. > > The warning would state that the current default is `axis=0` but that this > will change to `axis=1` in the next major release. If the user wants the > current default behavior then they could suppress the warning and > future-proof their code by passing `axis=0` explicitly. > > -Jon > > On Tue, Nov 28, 2017 at 6:28 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Would there be a way in keeping .select() but only deprecating the >> (default) `axis=0` ? Or would that only be more confusing? >> >> Because if we would find a name for such a method that defaults to the >> columns, we would come up with 'select' ... >> >> 2017-11-28 19:58 GMT+01:00 Stephan Hoyer : >> >>> On Tue, Nov 28, 2017 at 6:34 PM Paul Hobson wrote: >>> >>>> Thanks for the info. While .select on the default axis (index) is >>>> indeed very different than SQL, operating on the columns is very similar >>>> (jn my twisted brain at least). >>>> >>> >>> Agreed, but sadly .select() didn't default to axis=1. >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon.mease at gmail.com Tue Dec 5 11:55:15 2017 From: jon.mease at gmail.com (Jon Mease) Date: Tue, 5 Dec 2017 11:55:15 -0500 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: I like the name `.select()`, so I was hoping this might be a path towards changing the default to `axis=1` for 1.0. Would that be too soon for a default value change? Does pandas have any policy or precedent for how long things should be deprecated before being changed? -Jon On Tue, Dec 5, 2017 at 11:46 AM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > I think that could be a possibility, to force people to explicitly specify > the axis in select(), but it would still be only in the long term that > people can then actually drop this specification if they want to select the > columns. But maybe that's not too bad? > > Other possibility is another name, and then something like > select_columns() (or select_labels() if we don't want to have it > specifically for columns) is maybe an option? > > > 2017-11-29 1:29 GMT+01:00 Jon Mease : > >> Perhaps for versions 0.21.1 and 0.22 a warning could be issued when >> .select() is used without an explicit `axis` parameter. >> >> The warning would state that the current default is `axis=0` but that >> this will change to `axis=1` in the next major release. If the user wants >> the current default behavior then they could suppress the warning and >> future-proof their code by passing `axis=0` explicitly. >> >> -Jon >> >> On Tue, Nov 28, 2017 at 6:28 PM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Would there be a way in keeping .select() but only deprecating the >>> (default) `axis=0` ? Or would that only be more confusing? >>> >>> Because if we would find a name for such a method that defaults to the >>> columns, we would come up with 'select' ... >>> >>> 2017-11-28 19:58 GMT+01:00 Stephan Hoyer : >>> >>>> On Tue, Nov 28, 2017 at 6:34 PM Paul Hobson wrote: >>>> >>>>> Thanks for the info. While .select on the default axis (index) is >>>>> indeed very different than SQL, operating on the columns is very similar >>>>> (jn my twisted brain at least). >>>>> >>>> >>>> Agreed, but sadly .select() didn't default to axis=1. >>>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Dec 7 10:53:12 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 7 Dec 2017 09:53:12 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1512407288.2389.114.camel@pietrobattiston.it> Message-ID: On Mon, Dec 4, 2017 at 7:52 PM, Wes McKinney wrote: > To Jeff's point re this example: > > In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) > > In [3]: df > Out[3]: > A B > 0 NaN NaN > 1 NaN 0.0 > > In [4]: df.sum() > Out[4]: > A NaN > B 0.0 > dtype: float64 > > By adding a function which behaves in this way, but with a different > name, we keep the behavior available to the discerning user for whom > this distinction is meaningful. For other users, for whom this is not > meaningful, we give df.sum() the same meaning as df.sum().fillna(0). > > It's hard to predict which choice will cause the most or least harm to > users. In either case, we cannot spare our users the expectation of > some education about the behavior in the presence of missing (or no) > data. My guess is that the all-NA -> 0 behavior does the least harm by > default to the average user, because aggregates used in computations > like weighted sums will not propagate NaNs. > > If we need to bump to 0.22.0 to resolve the matter and add the new > function for Option 2 (in the event that we make Option 1 the behavior > of sum, which is my preference), that seems OK. If there are users > that are unsatisfied with the new behavior, we can at least defend > ourselves with the example set by NumPy's np.nansum and R's sum with > na.rm=T. Having the alternative method available for Option 2 IMHO > should be sufficient to satisfy such demanding users. > > - Wes > > On Mon, Dec 4, 2017 at 5:17 PM, Nathaniel Smith wrote: > > On Mon, Dec 4, 2017 at 9:12 AM, Wes McKinney > wrote: > >>> If I understand correctly, you have in mind a replacement for groupby > >> such that obj.REPLACEMENT(a_categorical).sum() will have NaN for non- > >> observed categories > >> > >> No, I am proposing to add a new aggregation method (an alternative to > >> "sum"). So something like > >> > >> s.groupby(...).total() > >> > >> or > >> > >> s.groupby(...).null_sum() > >> > >> (names are hard) > > > > Another spelling to consider would be something like > sum(skipna="if_any_valid") > > > > -n > > > > -- > > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > In an effort to get things rolling on this, here's an attempt to summarize. The majority (not unanimous) preference is for Option 1: Empty / all-NA sum to 0. SUM([]) = SUM([NA]) = 0. IIUC, Jeff prefers option 2 or 3. Jon and Chris prefer option 2. Nathaniel prefers options 1. This means we have two things to sort out before we can make a release: 1. Design and implement option 1 (including the alternative for returning NA) 2. Decide on the next releases version. I've opened https://github.com/pandas-dev/pandas/issues/18678 for the first item, if anyone wants to weigh in there. For the second item, see https://github.com/pandas-dev/pandas/issues/18244#issuecomment-350000655 Thanks, Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at pietrobattiston.it Thu Dec 7 15:21:59 2017 From: ml at pietrobattiston.it (Pietro Battiston) Date: Thu, 07 Dec 2017 21:21:59 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: <1512678119.17022.19.camel@pietrobattiston.it> Il giorno lun, 04/12/2017 alle 13.11 -0500, Jeff Reback ha scritto: > > We have been discussing this amongst the pandas core developers for > some time, and the general consensus is to adopt Option 1 (sum of > all-NA or empty is 0) as the behavior for sum with skipna=True. > > Actually, no there has not been general consensus among the core > developers. > > Everyone loves to say that > > s.sum([NA]) == 0 makes a ton of sense, but then you have my simple > example > from original issue, which Nathaniel did quote and I'll repeat here > (with a small modification): > > In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) > > In [3]: df > Out[3]:? > ? ? A? ? B > 0 NaN? NaN > 1 NaN? 0.0 > > In [4]: df.sum() > Out[4]:? > A? ? NaN > B? ? 0.0 > dtype: float64 > > > Option 1 is de-facto making [4] have A AND B == 0.0. This loses the > fact that you have 0 > present in B. If you conflate these, you then have a situation where > I do not > know that I had a valid value in B. > > Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose > information. No argument has been presented at all why this should > not hold. Actually, not losing information (about NaNs - since after all, you typically know/assume you do have some data) was my main argument for having skipna=False by default. But I perfectly understand that this is not open for discussion. And since it is not open for discussion, this means that when one does sum([...]) it actually means sum([...], skipna=True) Now, the bare minimum of consistency I expect from having an option "skipna" set to True is... to skip NA. In option 3, with "skipna=True", the presence of NA is _not_ irrelevant for the result. This does not affect option 2, but unless I'm wrong, consensus for sum([])=0 is unanimous. Just in case I'm wrong, think about the nightmare it would be to implement pandas sum() in dask with sum([])=NA. (Not because we necessarily care about dask internals, or consistency with dask... it's just an example of the annoying consequences) Pietro From jreback at yahoo.com Fri Dec 8 07:19:59 2017 From: jreback at yahoo.com (Jeff Reback) Date: Fri, 8 Dec 2017 12:19:59 +0000 (UTC) Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: <1020301426.981032.1512735599876@mail.yahoo.com> Using Tom's example In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]}) In [2]: dfOut[2]:?? ? A? ? B? C0 NaN? NaN? 01 NaN? 0.0? 0 In [3]: df.sum()Out[3]:?A? ? NaNB? ? 0.0C? ? 0.0dtype: float64 Pandas is all about propagating NaN's in a reliable and predictable way.Folks do a series of calculations, preserving NaNs. For examples In [5]: df.sum() + 1Out[5]:?A? ? NaNB? ? 1.0C? ? 1.0dtype: float64 makes it perfectly obvious that we have NaN preserving operations Option 1 is essentially: In [4]: df.fillna(0).sum()Out[4]:?A? ? 0.0B? ? 0.0C? ? 0.0dtype: float64 Using the same operation as [5], but showing all NaN sum to 0, we have have the situation?where we are no longer NaN preserving. In any actual real world calculation this is a disaster?and the worst possible scenario. In [6]: df.fillna(0).sum() + 1Out[6]:?A? ? 1.0B? ? 1.0C? ? 1.0dtype: float64 Changing this behavior shakes the core tenants of pandas, suddenly we have a special casewhere NaN propagation is not important anymore and worse you may get wrong answers. We have always consistently allowed reduction operations to return NaN (with the exception of count, which is actuallycounting non-nans).? I would argue that the folks who want guaranteed zero for all-NaN, can simply fill first. The reverse operation is simplynot possible, nor desired in any actual real world scenario. Pandas is not about strictly mathematical purity, rather about real world utility.? As for a decent compromise, option 3 is almost certainly the best option, where we revert the sum([]) == NA to be 0.? This would putus back to pre-0.21.0 pandas without bottleneck, likely the biggest installed population. This optionwould cause the least friction, while maintaining consistency and practicality. Making a radical departure from status quo (e.g. option 1) should have considered debate and not be 'rushed' in as a?quick 'fix' to a supposed problem. Jeff On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger wrote: On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback wrote: > We have been discussing this amongst the pandas core developers for some time, and the general consensus is to adopt Option 1 (sum of all-NA or empty is 0) as the behavior for sum with skipna=True. Actually, no there has not been general consensus among the core developers. I think that's the preference of the majority though. ? Everyone loves to say that s.sum([NA]) == 0 makes a ton of sense, but then you have my simple examplefrom original issue, which Nathaniel did quote and I'll repeat here (with a small modification): In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) In [3]: dfOut[3]:?? ? A? ? B0 NaN? NaN1 NaN? 0.0 In [4]: df.sum()Out[4]:?A? ? NaNB? ? 0.0dtype: float64 Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact that you have 0present in B. If you conflate these, you then have a situation where I do notknow that I had a valid value in B. Option 2 (and 3) for that matter preserves [4]. This DOES NOT loseinformation. No argument has been presented at all why this should not hold. >From [4] it follows that sum([NA]) must be NA. Extending that slightly: In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]}) In [5]: df.sum() Out[5]: A??? NaN B??? 0.0 C??? 0.0 dtype: float64 This is why I don't think the "preserving information" argument is correct. Taking "Preserving information" to its logical conclusion would return NaN for "B", since that distinguishes between the sum of all valid and the the sum with some NaNs. I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA is more consistent withthe rest of pandas (IOW *every* other operation on an empty Series returns NA). > * We should prepare a 0.21.1 release in short order with Option 1 implemented for sum() (always 0 for empty/all-null) and prod() (1, respectively) I can certainly understand pandas reverting back to the de-facto state of affairs priorto 0.21.0, which would be option 3, but a radical change on a minor release isnot warranted at all. Frankly, we only have (and are likely to get) even a smallfraction of users opinions on this whole matter. Yeah, agreed that bumping to 0.22 is for the best. ? Jeff On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney wrote: We have been discussing this amongst the pandas core developers for some time, and the general consensus is to adopt Option 1 (sum of all-NA or empty is 0) as the behavior for sum with skipna=True. In a groupby setting, and with categorical group keys, the issue becomes a bit more nuanced -- if you group by a categorical, and one of the categories is not observed at all in the dataset, e.g: s.groupby(some_categorical).su m() This change will necessarily yield a Series containing no nulls -- so if there is a category containing no data, then the sum for that category is 0. For the sake of algebraic completeness, I believe we should introduce a new aggregation method that performs Option 2 (equivalent to what pandas 0.21.0 is currently doing for sum()), so that empty or all-NA yields NA. So the TL;DR is: * We should prepare a 0.21.1 release in short order with Option 1 implemented for sum() (always 0 for empty/all-null) and prod() (1, respectively) * Add a new method for Option 2, either in 0.21.1 or in a later release We should probably alert the long GitHub thread that this discussion is taking place before we cut the release. Since GitHub comments can be permanently deleted at any time, I think it's better for discussions about significant issues like this to take place on the permanent public record. Thanks Wes On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston wrote: > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >> [...] > > I think Nathaniel just expressed my thoughts better than I was/would be > able to! > > Pietro > ______________________________ _________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailma n/listinfo/pandas-dev ______________________________ _________________ Pandas-dev mailing list Pandas-dev at python.org https://mail.python.org/mailma n/listinfo/pandas-dev ______________________________ _________________ Pandas-dev mailing list Pandas-dev at python.org https://mail.python.org/ mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at pietrobattiston.it Fri Dec 8 08:36:10 2017 From: ml at pietrobattiston.it (Pietro Battiston) Date: Fri, 08 Dec 2017 14:36:10 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1020301426.981032.1512735599876@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: <1512740170.17022.33.camel@pietrobattiston.it> Il giorno ven, 08/12/2017 alle 12.19 +0000, Jeff Reback via Pandas-dev ha scritto: > [...] > Changing this behavior shakes the core tenants of pandas, suddenly we > have a special case > where NaN propagation is not important anymore and worse you may get > wrong answers. Having to do with tons of NaNs every day, I understand your concerns, but I see only one solution to your problems, "skipna=False". > I would argue that the folks who want guaranteed zero for all-NaN, > can simply fill first. The reverse operation is simply > not possible, nor desired in any actual real world scenario. Isn't it skipna=False?! > > Pandas is not about strictly mathematical purity, rather about real > world utility.? Sure utility matters, but I think consistency helps avoiding surprises, which is very useful, in particular in the middle of a huge API. Pietro From me at pietrobattiston.it Fri Dec 8 08:36:04 2017 From: me at pietrobattiston.it (Pietro Battiston) Date: Fri, 08 Dec 2017 14:36:04 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1020301426.981032.1512735599876@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: <1512740164.17022.32.camel@pietrobattiston.it> Il giorno ven, 08/12/2017 alle 12.19 +0000, Jeff Reback via Pandas-dev ha scritto: > [...] > Changing this behavior shakes the core tenants of pandas, suddenly we > have a special case > where NaN propagation is not important anymore and worse you may get > wrong answers. Having to do with tons of NaNs every day, I understand your concerns, but I see only one solution to your problems, "skipna=False". > I would argue that the folks who want guaranteed zero for all-NaN, > can simply fill first. The reverse operation is simply > not possible, nor desired in any actual real world scenario. Isn't it skipna=False?! > > Pandas is not about strictly mathematical purity, rather about real > world utility.? Sure utility matters, but I think consistency helps avoiding surprises, which is very useful, in particular in the middle of a huge API. Pietro From jorisvandenbossche at gmail.com Fri Dec 8 09:54:10 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 8 Dec 2017 15:54:10 +0100 Subject: [Pandas-dev] Proposal to change the default number of rows for DataFrame display (lower max_rows) Message-ID: *[Note for those reading it on the pydata mailing list, please answer to pandas-dev at python.org to keep discussion centralised there]* Hi all, I am reposting the mail of Clemens below, but with slightly changed focus, as I think the main discussion point is about the number of rows. The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to lower the default number of rows shown when displaying a Series or DataFrame from 60 to 20. Thoughts on that? Best, Joris 2017-11-28 11:57 GMT+01:00 Clemens Brunner : > Hello! > > We're currently discussing a change in how data frames are displayed by > default in https://github.com/pandas-dev/pandas/pull/17023. There are two > proposed changes: > > (1) Set pd.options.display.max_columns=0 (previously this was set to 20). > (2) Set pd.options.display.max_rows=20 (previously this was set to 60). > > Change (1) means that the number of printed columns is adapted to fit > within the width of the terminal. If there are too many columns, ellipsis > will be shown to indicate collapsed columns in the middle of the data > frame. This doesn't work if Python is run as a Jupyter kernel (e.g. in a > Jupyter notebook or in IPython QtConsole), in which case the maximum > columns remain 20. > > Example: > ======== > import pandas as pd > import numpy as np > pd.DataFrame(np.random.rand(5, 10)) > > Output before (in a terminal with 100 chars width): > --------------------------------------------------- > 0 1 2 3 4 5 6 \ > 0 0.643979 0.690414 0.018603 0.991478 0.707534 0.376765 0.670848 > 1 0.547836 0.810972 0.054448 0.415112 0.268120 0.904528 0.839258 > 2 0.582256 0.732149 0.284208 0.405197 0.213591 0.715367 0.150106 > 3 0.197348 0.317159 0.051669 0.738405 0.821046 0.179270 0.245793 > 4 0.483466 0.583330 0.999213 0.882883 0.315169 0.045712 0.897048 > > 7 8 9 > 0 0.891467 0.494220 0.713369 > 1 0.601304 0.449880 0.266205 > 2 0.113262 0.360580 0.238833 > 3 0.798063 0.077769 0.471169 > 4 0.262779 0.530565 0.992084 > > Output after: > ------------- > 0 1 2 3 ... 6 7 > 8 9 > 0 0.673621 0.211505 0.943201 0.946548 ... 0.900453 0.612182 > 0.861933 0.710967 > 1 0.670855 0.834449 0.796273 0.785976 ... 0.609954 0.686663 > 0.684582 0.837505 > 2 0.544736 0.814827 0.352893 0.459556 ... 0.650993 0.735943 > 0.279110 0.840203 > 3 0.440125 0.554323 0.745462 0.940896 ... 0.544576 0.224175 > 0.852603 0.509837 > 4 0.225551 0.791834 0.476059 0.321857 ... 0.391165 0.423213 > 0.290683 0.954423 > > [5 rows x 10 columns] > > > Change (2) implies fewer rows are displayed before auto-hiding takes > place. I find that 60 rows almost always causes the terminal to scroll > (most terminals have between 25-40 rows), so reducing the value to 20 > increases the chance that a data frame can be observed on one terminal > page. I'm not including a before/after output since it should be easy to > imagine how this change affects the output. > > Both changes would make Pandas behave similar to R's Tidyverse (which I > really like), but this should not be the main reason why these changes are > a good idea. I mainly like them because these settings make (large) data > frames much nicer to look at. > > Note that these changes affect the default values. Of course, users are > free to change them back in their active Python session. > > Comments to both proposed changes are highly welcome (either here on the > mailing list or at https://github.com/pandas-dev/pandas/pull/17023. > > Clemens > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Dec 8 10:11:24 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 8 Dec 2017 09:11:24 -0600 Subject: [Pandas-dev] [pydata] Proposal to change the default number of rows for DataFrame display (lower max_rows) In-Reply-To: References: Message-ID: On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > *[Note for those reading it on the pydata mailing list, please answer to > pandas-dev at python.org to keep discussion > centralised there]* > > Hi all, > > I am reposting the mail of Clemens below, but with slightly changed focus, > as I think the main discussion point is about the number of rows. > > The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to > lower the default number of rows shown when displaying a Series or > DataFrame from 60 to 20. > Thoughts on that? > Personally, I always set the max rows to 10 or 20, so I'd be OK with it if the community is on board. Tom > Best, > Joris > > > 2017-11-28 11:57 GMT+01:00 Clemens Brunner : > >> Hello! >> >> We're currently discussing a change in how data frames are displayed by >> default in https://github.com/pandas-dev/pandas/pull/17023. There are >> two proposed changes: >> >> (1) Set pd.options.display.max_columns=0 (previously this was set to 20). >> (2) Set pd.options.display.max_rows=20 (previously this was set to 60). >> >> Change (1) means that the number of printed columns is adapted to fit >> within the width of the terminal. If there are too many columns, ellipsis >> will be shown to indicate collapsed columns in the middle of the data >> frame. This doesn't work if Python is run as a Jupyter kernel (e.g. in a >> Jupyter notebook or in IPython QtConsole), in which case the maximum >> columns remain 20. >> >> Example: >> ======== >> import pandas as pd >> import numpy as np >> pd.DataFrame(np.random.rand(5, 10)) >> >> Output before (in a terminal with 100 chars width): >> --------------------------------------------------- >> 0 1 2 3 4 5 6 \ >> 0 0.643979 0.690414 0.018603 0.991478 0.707534 0.376765 0.670848 >> 1 0.547836 0.810972 0.054448 0.415112 0.268120 0.904528 0.839258 >> 2 0.582256 0.732149 0.284208 0.405197 0.213591 0.715367 0.150106 >> 3 0.197348 0.317159 0.051669 0.738405 0.821046 0.179270 0.245793 >> 4 0.483466 0.583330 0.999213 0.882883 0.315169 0.045712 0.897048 >> >> 7 8 9 >> 0 0.891467 0.494220 0.713369 >> 1 0.601304 0.449880 0.266205 >> 2 0.113262 0.360580 0.238833 >> 3 0.798063 0.077769 0.471169 >> 4 0.262779 0.530565 0.992084 >> >> Output after: >> ------------- >> 0 1 2 3 ... 6 7 >> 8 9 >> 0 0.673621 0.211505 0.943201 0.946548 ... 0.900453 0.612182 >> 0.861933 0.710967 >> 1 0.670855 0.834449 0.796273 0.785976 ... 0.609954 0.686663 >> 0.684582 0.837505 >> 2 0.544736 0.814827 0.352893 0.459556 ... 0.650993 0.735943 >> 0.279110 0.840203 >> 3 0.440125 0.554323 0.745462 0.940896 ... 0.544576 0.224175 >> 0.852603 0.509837 >> 4 0.225551 0.791834 0.476059 0.321857 ... 0.391165 0.423213 >> 0.290683 0.954423 >> >> [5 rows x 10 columns] >> >> >> Change (2) implies fewer rows are displayed before auto-hiding takes >> place. I find that 60 rows almost always causes the terminal to scroll >> (most terminals have between 25-40 rows), so reducing the value to 20 >> increases the chance that a data frame can be observed on one terminal >> page. I'm not including a before/after output since it should be easy to >> imagine how this change affects the output. >> >> Both changes would make Pandas behave similar to R's Tidyverse (which I >> really like), but this should not be the main reason why these changes are >> a good idea. I mainly like them because these settings make (large) data >> frames much nicer to look at. >> >> Note that these changes affect the default values. Of course, users are >> free to change them back in their active Python session. >> >> Comments to both proposed changes are highly welcome (either here on the >> mailing list or at https://github.com/pandas-dev/pandas/pull/17023. >> >> Clemens >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > -- > You received this message because you are subscribed to the Google Groups > "PyData" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pydata+unsubscribe at googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Dec 8 10:38:01 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 8 Dec 2017 09:38:01 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1020301426.981032.1512735599876@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 6:19 AM, Jeff Reback wrote: > Using Tom's example > > In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': > [0, 0]}) > > In [2]: df > Out[2]: > A B C > 0 NaN NaN 0 > 1 NaN 0.0 0 > > In [3]: df.sum() > Out[3]: > A NaN > B 0.0 > C 0.0 > dtype: float64 > > > Pandas is all about propagating NaN's in a reliable and predictable way. > Folks do a series of calculations, preserving NaNs. For examples > > In [5]: df.sum() + 1 > Out[5]: > A NaN > B 1.0 > C 1.0 > dtype: float64 > > makes it perfectly obvious that we have NaN preserving operations > We don't always though. Aggregations explicitly skip NaNs: In [3]: pd.Series([1, np.nan]).sum() Out[3]: 1.0 I don't think "how aggregations handle NA" need be consistent with "how binops handle NA". > Option 1 is essentially: > > In [4]: df.fillna(0).sum() > Out[4]: > A 0.0 > B 0.0 > C 0.0 > dtype: float64 > Using the same operation as [5], but showing all NaN sum to 0, we have have > the situation > where we are no longer NaN preserving. In any actual real world > calculation this is a disaster > and the worst possible scenario. > > In [6]: df.fillna(0).sum() + 1 > Out[6]: > A 1.0 > B 1.0 > C 1.0 > dtype: float64 > > > Changing this behavior shakes the core tenants of pandas, suddenly we have > a special case > where NaN propagation is not important anymore and worse you may get wrong > answers. > > We have always consistently allowed reduction operations to return NaN > (with the exception of count, which is actually > counting non-nans). > > I would argue that the folks who want guaranteed zero for all-NaN, can > simply fill first. The reverse operation is simply > not possible, nor desired in any actual real world scenario. > If we pursue option 1, we would add a keyword to make the reverse operation possible. I think the best analogy here is to `skipna`. The argument "people should fill first" applies equally well to people who say `skipna` should be False by default, because that propagates NaNs (not that anyone *is* arguing that). If we add a keyword to sum like `all_na_is_na` that's equivalent to `skipna`, then we have: >>> df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, 0]}) >>> df.sum(skipna=True, all_na_is_na=False) # the default A 0.0 B 0.0 C 0.0 dtype: float64 >>> df.sum(skipna=True, all_na_is_na=True) A NaN B 0.0 C 0.0 dtype: float64 >>> df.sum(skipna=False, all_na_is_na=True) A NaN B NaN C 0.0 dtype: float64 >>> df.sum(skipna=False, all_na_is_na=False) # ValueError? So we shouldn't be discussing which one is possible. Both will be, it's a matter of choosing the defaults. > Pandas is not about strictly mathematical purity, rather about real world > utility. > > As for a decent compromise, option 3 is almost certainly the best option, > where we revert the sum([]) == NA to be 0. This would put > us back to pre-0.21.0 pandas without bottleneck, likely the biggest > installed population. This option > would cause the least friction, while maintaining consistency and > practicality. > > Making a radical departure from status quo (e.g. option 1) should have > considered debate and not be 'rushed' in as a > quick 'fix' to a supposed problem. > I don't think we're rushing things. I'm not holding out hope for unanimous agreement, but at some point we will need to do a release. I have a slight preference for getting things done sooner, so that 0.21.0 is used by as few people as possible. But getting things right for the next release is the most important thing. Tom > Jeff > On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > > > > > On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback wrote: > > > We have been discussing this amongst the pandas core developers for > some time, and the general consensus is to adopt Option 1 (sum of > all-NA or empty is 0) as the behavior for sum with skipna=True. > > Actually, no there has not been general consensus among the core > developers. > > > I think that's the preference of the majority though. > > > Everyone loves to say that > > s.sum([NA]) == 0 makes a ton of sense, but then you have my simple example > from original issue, which Nathaniel did quote and I'll repeat here (with > a small modification): > > In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) > > In [3]: df > Out[3]: > A B > 0 NaN NaN > 1 NaN 0.0 > > In [4]: df.sum() > Out[4]: > A NaN > B 0.0 > dtype: float64 > > > Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact > that you have 0 > present in B. If you conflate these, you then have a situation where I do > not > know that I had a valid value in B. > > Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose > information. No argument has been presented at all why this should not > hold. > > From [4] it follows that sum([NA]) must be NA. > > > Extending that slightly: > > > In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': > [0, 0]}) > > In [5]: df.sum() > Out[5]: > A NaN > B 0.0 > C 0.0 > dtype: float64 > > This is why I don't think the "preserving information" argument is > correct. Taking "Preserving information" > to its logical conclusion would return NaN for "B", since that > distinguishes between the sum of all > valid and the the sum with some NaNs. > > I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA > is more consistent with > the rest of pandas (IOW *every* other operation on an empty Series returns > NA). > > > * We should prepare a 0.21.1 release in short order with Option 1 > implemented for sum() (always 0 for empty/all-null) and prod() (1, > respectively) > > I can certainly understand pandas reverting back to the de-facto state of > affairs prior > to 0.21.0, which would be option 3, but a radical change on a minor > release is > not warranted at all. Frankly, we only have (and are likely to get) even a > small > fraction of users opinions on this whole matter. > > > Yeah, agreed that bumping to 0.22 is for the best. > > > Jeff > > > On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney wrote: > > We have been discussing this amongst the pandas core developers for > some time, and the general consensus is to adopt Option 1 (sum of > all-NA or empty is 0) as the behavior for sum with skipna=True. > > In a groupby setting, and with categorical group keys, the issue > becomes a bit more nuanced -- if you group by a categorical, and one > of the categories is not observed at all in the dataset, e.g: > > s.groupby(some_categorical).su m() > > This change will necessarily yield a Series containing no nulls -- so > if there is a category containing no data, then the sum for that > category is 0. > > For the sake of algebraic completeness, I believe we should introduce > a new aggregation method that performs Option 2 (equivalent to what > pandas 0.21.0 is currently doing for sum()), so that empty or all-NA > yields NA. > > So the TL;DR is: > > * We should prepare a 0.21.1 release in short order with Option 1 > implemented for sum() (always 0 for empty/all-null) and prod() (1, > respectively) > * Add a new method for Option 2, either in 0.21.1 or in a later release > > We should probably alert the long GitHub thread that this discussion > is taking place before we cut the release. Since GitHub comments can > be permanently deleted at any time, I think it's better for > discussions about significant issues like this to take place on the > permanent public record. > > Thanks > Wes > > On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston > wrote: > > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: > >> [...] > > > > I think Nathaniel just expressed my thoughts better than I was/would be > > able to! > > > > Pietro > > ______________________________ _________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailma n/listinfo/pandas-dev > > ______________________________ _________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailma n/listinfo/pandas-dev > > > > > ______________________________ _________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/ mailman/listinfo/pandas-dev > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Dec 8 11:02:30 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 8 Dec 2017 10:02:30 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 9:38 AM, Tom Augspurger wrote: > > > On Fri, Dec 8, 2017 at 6:19 AM, Jeff Reback wrote: > >> Using Tom's example >> >> In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >> [0, 0]}) >> >> In [2]: df >> Out[2]: >> A B C >> 0 NaN NaN 0 >> 1 NaN 0.0 0 >> >> In [3]: df.sum() >> Out[3]: >> A NaN >> B 0.0 >> C 0.0 >> dtype: float64 >> >> >> Pandas is all about propagating NaN's in a reliable and predictable way. >> Folks do a series of calculations, preserving NaNs. For examples >> >> In [5]: df.sum() + 1 >> Out[5]: >> A NaN >> B 1.0 >> C 1.0 >> dtype: float64 >> >> makes it perfectly obvious that we have NaN preserving operations >> > > We don't always though. Aggregations explicitly skip NaNs: > > In [3]: pd.Series([1, np.nan]).sum() > Out[3]: 1.0 > > I don't think "how aggregations handle NA" need be consistent with "how > binops handle NA". > > >> Option 1 is essentially: >> >> In [4]: df.fillna(0).sum() >> Out[4]: >> A 0.0 >> B 0.0 >> C 0.0 >> dtype: float64 >> > Using the same operation as [5], but showing all NaN sum to 0, we have >> have the situation >> where we are no longer NaN preserving. In any actual real world >> calculation this is a disaster >> and the worst possible scenario. >> >> In [6]: df.fillna(0).sum() + 1 >> Out[6]: >> A 1.0 >> B 1.0 >> C 1.0 >> dtype: float64 >> >> >> Changing this behavior shakes the core tenants of pandas, suddenly we >> have a special case >> where NaN propagation is not important anymore and worse you may get >> wrong answers. >> >> We have always consistently allowed reduction operations to return NaN >> (with the exception of count, which is actually >> counting non-nans). >> >> I would argue that the folks who want guaranteed zero for all-NaN, can >> simply fill first. The reverse operation is simply >> not possible, nor desired in any actual real world scenario. >> > > If we pursue option 1, we would add a keyword to make the reverse > operation possible. > > I think the best analogy here is to `skipna`. The argument "people should > fill first" applies equally well to people who > say `skipna` should be False by default, because that propagates NaNs (not > that anyone *is* arguing that). If we add > a keyword to sum like `all_na_is_na` that's equivalent to `skipna`, then > we have: > > > >>> df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, > 0]}) > >>> df.sum(skipna=True, all_na_is_na=False) # the default > A 0.0 > B 0.0 > C 0.0 > dtype: float64 > > >>> df.sum(skipna=True, all_na_is_na=True) > A NaN > B 0.0 > C 0.0 > dtype: float64 > > >>> df.sum(skipna=False, all_na_is_na=True) > A NaN > B NaN > C 0.0 > dtype: float64 > > >>> df.sum(skipna=False, all_na_is_na=False) # ValueError? > > So we shouldn't be discussing which one is possible. Both will be, it's a > matter of choosing the defaults. > > > >> Pandas is not about strictly mathematical purity, rather about real world >> utility. >> >> As for a decent compromise, option 3 is almost certainly the best option, >> where we revert the sum([]) == NA to be 0. This would put >> us back to pre-0.21.0 pandas without bottleneck, likely the biggest >> installed population. This option >> would cause the least friction, while maintaining consistency and >> practicality. >> >> Making a radical departure from status quo (e.g. option 1) should have >> considered debate and not be 'rushed' in as a >> quick 'fix' to a supposed problem. >> > > I don't think we're rushing things. I'm not holding out hope for unanimous > agreement, but at some point we will > need to do a release. I have a slight preference for getting things done > sooner, so that 0.21.0 is used by as > few people as possible. But getting things right for the next release is > the most important thing. > In case email is too low bandwidth for this discussion (and how it affects the next releases naming and timing), I'm free to do a video chat any time today, and post a summary to the mailing list on what we cover. How about 17:30 UTC (1.5 hours from now?). I'm flexible, though that's 5:30 PM in Europe so the soon the better for them. Tom Tom > > >> Jeff >> On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >> >> >> >> On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback >> wrote: >> >> > We have been discussing this amongst the pandas core developers for >> some time, and the general consensus is to adopt Option 1 (sum of >> all-NA or empty is 0) as the behavior for sum with skipna=True. >> >> Actually, no there has not been general consensus among the core >> developers. >> >> >> I think that's the preference of the majority though. >> >> >> Everyone loves to say that >> >> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple >> example >> from original issue, which Nathaniel did quote and I'll repeat here (with >> a small modification): >> >> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) >> >> In [3]: df >> Out[3]: >> A B >> 0 NaN NaN >> 1 NaN 0.0 >> >> In [4]: df.sum() >> Out[4]: >> A NaN >> B 0.0 >> dtype: float64 >> >> >> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact >> that you have 0 >> present in B. If you conflate these, you then have a situation where I do >> not >> know that I had a valid value in B. >> >> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose >> information. No argument has been presented at all why this should not >> hold. >> >> From [4] it follows that sum([NA]) must be NA. >> >> >> Extending that slightly: >> >> >> In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >> [0, 0]}) >> >> In [5]: df.sum() >> Out[5]: >> A NaN >> B 0.0 >> C 0.0 >> dtype: float64 >> >> This is why I don't think the "preserving information" argument is >> correct. Taking "Preserving information" >> to its logical conclusion would return NaN for "B", since that >> distinguishes between the sum of all >> valid and the the sum with some NaNs. >> >> I am indifferent whether sum([]) == 0 or NA. Though I would argue that NA >> is more consistent with >> the rest of pandas (IOW *every* other operation on an empty Series >> returns NA). >> >> > * We should prepare a 0.21.1 release in short order with Option 1 >> implemented for sum() (always 0 for empty/all-null) and prod() (1, >> respectively) >> >> I can certainly understand pandas reverting back to the de-facto state of >> affairs prior >> to 0.21.0, which would be option 3, but a radical change on a minor >> release is >> not warranted at all. Frankly, we only have (and are likely to get) even >> a small >> fraction of users opinions on this whole matter. >> >> >> Yeah, agreed that bumping to 0.22 is for the best. >> >> >> Jeff >> >> >> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney >> wrote: >> >> We have been discussing this amongst the pandas core developers for >> some time, and the general consensus is to adopt Option 1 (sum of >> all-NA or empty is 0) as the behavior for sum with skipna=True. >> >> In a groupby setting, and with categorical group keys, the issue >> becomes a bit more nuanced -- if you group by a categorical, and one >> of the categories is not observed at all in the dataset, e.g: >> >> s.groupby(some_categorical).su m() >> >> This change will necessarily yield a Series containing no nulls -- so >> if there is a category containing no data, then the sum for that >> category is 0. >> >> For the sake of algebraic completeness, I believe we should introduce >> a new aggregation method that performs Option 2 (equivalent to what >> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >> yields NA. >> >> So the TL;DR is: >> >> * We should prepare a 0.21.1 release in short order with Option 1 >> implemented for sum() (always 0 for empty/all-null) and prod() (1, >> respectively) >> * Add a new method for Option 2, either in 0.21.1 or in a later release >> >> We should probably alert the long GitHub thread that this discussion >> is taking place before we cut the release. Since GitHub comments can >> be permanently deleted at any time, I think it's better for >> discussions about significant issues like this to take place on the >> permanent public record. >> >> Thanks >> Wes >> >> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston >> wrote: >> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >> >> [...] >> > >> > I think Nathaniel just expressed my thoughts better than I was/would be >> > able to! >> > >> > Pietro >> > ______________________________ _________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailma n/listinfo/pandas-dev >> >> ______________________________ _________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailma n/listinfo/pandas-dev >> >> >> >> >> ______________________________ _________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/ mailman/listinfo/pandas-dev >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Dec 8 12:29:19 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 8 Dec 2017 11:29:19 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 10:02 AM, Tom Augspurger wrote: > On Fri, Dec 8, 2017 at 9:38 AM, Tom Augspurger > wrote: > >> >> >> On Fri, Dec 8, 2017 at 6:19 AM, Jeff Reback wrote: >> >>> Using Tom's example >>> >>> In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >>> [0, 0]}) >>> >>> In [2]: df >>> Out[2]: >>> A B C >>> 0 NaN NaN 0 >>> 1 NaN 0.0 0 >>> >>> In [3]: df.sum() >>> Out[3]: >>> A NaN >>> B 0.0 >>> C 0.0 >>> dtype: float64 >>> >>> >>> Pandas is all about propagating NaN's in a reliable and predictable way. >>> Folks do a series of calculations, preserving NaNs. For examples >>> >>> In [5]: df.sum() + 1 >>> Out[5]: >>> A NaN >>> B 1.0 >>> C 1.0 >>> dtype: float64 >>> >>> makes it perfectly obvious that we have NaN preserving operations >>> >> >> We don't always though. Aggregations explicitly skip NaNs: >> >> In [3]: pd.Series([1, np.nan]).sum() >> Out[3]: 1.0 >> >> I don't think "how aggregations handle NA" need be consistent with "how >> binops handle NA". >> >> >>> Option 1 is essentially: >>> >>> In [4]: df.fillna(0).sum() >>> Out[4]: >>> A 0.0 >>> B 0.0 >>> C 0.0 >>> dtype: float64 >>> >> Using the same operation as [5], but showing all NaN sum to 0, we have >>> have the situation >>> where we are no longer NaN preserving. In any actual real world >>> calculation this is a disaster >>> and the worst possible scenario. >>> >>> In [6]: df.fillna(0).sum() + 1 >>> Out[6]: >>> A 1.0 >>> B 1.0 >>> C 1.0 >>> dtype: float64 >>> >>> >>> Changing this behavior shakes the core tenants of pandas, suddenly we >>> have a special case >>> where NaN propagation is not important anymore and worse you may get >>> wrong answers. >>> >>> We have always consistently allowed reduction operations to return NaN >>> (with the exception of count, which is actually >>> counting non-nans). >>> >>> I would argue that the folks who want guaranteed zero for all-NaN, can >>> simply fill first. The reverse operation is simply >>> not possible, nor desired in any actual real world scenario. >>> >> >> If we pursue option 1, we would add a keyword to make the reverse >> operation possible. >> >> I think the best analogy here is to `skipna`. The argument "people should >> fill first" applies equally well to people who >> say `skipna` should be False by default, because that propagates NaNs >> (not that anyone *is* arguing that). If we add >> a keyword to sum like `all_na_is_na` that's equivalent to `skipna`, then >> we have: >> >> >> >>> df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, >> 0]}) >> >>> df.sum(skipna=True, all_na_is_na=False) # the default >> A 0.0 >> B 0.0 >> C 0.0 >> dtype: float64 >> >> >>> df.sum(skipna=True, all_na_is_na=True) >> A NaN >> B 0.0 >> C 0.0 >> dtype: float64 >> >> >>> df.sum(skipna=False, all_na_is_na=True) >> A NaN >> B NaN >> C 0.0 >> dtype: float64 >> >> >>> df.sum(skipna=False, all_na_is_na=False) # ValueError? >> >> So we shouldn't be discussing which one is possible. Both will be, it's a >> matter of choosing the defaults. >> >> >> >>> Pandas is not about strictly mathematical purity, rather about real >>> world utility. >>> >>> As for a decent compromise, option 3 is almost certainly the best >>> option, where we revert the sum([]) == NA to be 0. This would put >>> us back to pre-0.21.0 pandas without bottleneck, likely the biggest >>> installed population. This option >>> would cause the least friction, while maintaining consistency and >>> practicality. >>> >>> Making a radical departure from status quo (e.g. option 1) should have >>> considered debate and not be 'rushed' in as a >>> quick 'fix' to a supposed problem. >>> >> >> I don't think we're rushing things. I'm not holding out hope for >> unanimous agreement, but at some point we will >> need to do a release. I have a slight preference for getting things done >> sooner, so that 0.21.0 is used by as >> few people as possible. But getting things right for the next release is >> the most important thing. >> > > In case email is too low bandwidth for this discussion (and how it affects > the next releases naming and timing), I'm > free to do a video chat any time today, and post a summary to the mailing > list on what we cover. How about > 17:30 UTC (1.5 hours from now?). I'm flexible, though that's 5:30 PM in > Europe so the soon the better for them. > > Tom > > Tom >> >> >>> Jeff >>> On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger < >>> tom.augspurger88 at gmail.com> wrote: >>> >>> >>> >>> >>> On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback >>> wrote: >>> >>> > We have been discussing this amongst the pandas core developers for >>> some time, and the general consensus is to adopt Option 1 (sum of >>> all-NA or empty is 0) as the behavior for sum with skipna=True. >>> >>> Actually, no there has not been general consensus among the core >>> developers. >>> >>> >>> I think that's the preference of the majority though. >>> >>> >>> Everyone loves to say that >>> >>> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple >>> example >>> from original issue, which Nathaniel did quote and I'll repeat here >>> (with a small modification): >>> >>> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) >>> >>> In [3]: df >>> Out[3]: >>> A B >>> 0 NaN NaN >>> 1 NaN 0.0 >>> >>> In [4]: df.sum() >>> Out[4]: >>> A NaN >>> B 0.0 >>> dtype: float64 >>> >>> >>> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact >>> that you have 0 >>> present in B. If you conflate these, you then have a situation where I >>> do not >>> know that I had a valid value in B. >>> >>> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose >>> information. No argument has been presented at all why this should not >>> hold. >>> >>> From [4] it follows that sum([NA]) must be NA. >>> >>> >>> Extending that slightly: >>> >>> >>> In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >>> [0, 0]}) >>> >>> In [5]: df.sum() >>> Out[5]: >>> A NaN >>> B 0.0 >>> C 0.0 >>> dtype: float64 >>> >>> This is why I don't think the "preserving information" argument is >>> correct. Taking "Preserving information" >>> to its logical conclusion would return NaN for "B", since that >>> distinguishes between the sum of all >>> valid and the the sum with some NaNs. >>> >>> I am indifferent whether sum([]) == 0 or NA. Though I would argue that >>> NA is more consistent with >>> the rest of pandas (IOW *every* other operation on an empty Series >>> returns NA). >>> >>> > * We should prepare a 0.21.1 release in short order with Option 1 >>> implemented for sum() (always 0 for empty/all-null) and prod() (1, >>> respectively) >>> >>> I can certainly understand pandas reverting back to the de-facto state >>> of affairs prior >>> to 0.21.0, which would be option 3, but a radical change on a minor >>> release is >>> not warranted at all. Frankly, we only have (and are likely to get) even >>> a small >>> fraction of users opinions on this whole matter. >>> >>> >>> Yeah, agreed that bumping to 0.22 is for the best. >>> >>> >>> Jeff >>> >>> >>> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney >>> wrote: >>> >>> We have been discussing this amongst the pandas core developers for >>> some time, and the general consensus is to adopt Option 1 (sum of >>> all-NA or empty is 0) as the behavior for sum with skipna=True. >>> >>> In a groupby setting, and with categorical group keys, the issue >>> becomes a bit more nuanced -- if you group by a categorical, and one >>> of the categories is not observed at all in the dataset, e.g: >>> >>> s.groupby(some_categorical).su m() >>> >>> This change will necessarily yield a Series containing no nulls -- so >>> if there is a category containing no data, then the sum for that >>> category is 0. >>> >>> For the sake of algebraic completeness, I believe we should introduce >>> a new aggregation method that performs Option 2 (equivalent to what >>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >>> yields NA. >>> >>> So the TL;DR is: >>> >>> * We should prepare a 0.21.1 release in short order with Option 1 >>> implemented for sum() (always 0 for empty/all-null) and prod() (1, >>> respectively) >>> * Add a new method for Option 2, either in 0.21.1 or in a later release >>> >>> We should probably alert the long GitHub thread that this discussion >>> is taking place before we cut the release. Since GitHub comments can >>> be permanently deleted at any time, I think it's better for >>> discussions about significant issues like this to take place on the >>> permanent public record. >>> >>> Thanks >>> Wes >>> >>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston >>> wrote: >>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha scritto: >>> >> [...] >>> > >>> > I think Nathaniel just expressed my thoughts better than I was/would be >>> > able to! >>> > >>> > Pietro >>> > ______________________________ _________________ >>> > Pandas-dev mailing list >>> > Pandas-dev at python.org >>> > https://mail.python.org/mailma n/listinfo/pandas-dev >>> >>> ______________________________ _________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailma n/listinfo/pandas-dev >>> >>> >>> >>> >>> ______________________________ _________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/ mailman/listinfo/pandas-dev >>> >>> >>> >>> >> > > I'll be in https://appear.in/pandas for the next hour or so. Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Dec 8 14:16:16 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Dec 2017 14:16:16 -0500 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: > As for a decent compromise, option 3 is almost certainly the best option, where we revert the sum([]) == NA to be 0. This would put us back to pre-0.21.0 pandas without bottleneck, likely the biggest installed population. This option would cause the least friction, while maintaining consistency and practicality. If we want to view the sum([]) -> NA as a regression from 0.20.3, and we are not prepared to commit to option 1 for 0.22.0 (which it seems we are not ready to commit to this), then I would suggest the conservative option is to revert to Option 3 which was the behavior in <= 0.20.3. I believe that Option 1 is the better choice as the default behavior for sum (always 0), and adding a new method that is Option 2 (always NA), but I don't see a need to force this issue in a minor release time frame. Then we have more time to solicit feedback On Fri, Dec 8, 2017 at 12:29 PM, Tom Augspurger wrote: > > On Fri, Dec 8, 2017 at 10:02 AM, Tom Augspurger > wrote: >> >> On Fri, Dec 8, 2017 at 9:38 AM, Tom Augspurger >> wrote: >>> >>> >>> >>> On Fri, Dec 8, 2017 at 6:19 AM, Jeff Reback wrote: >>>> >>>> Using Tom's example >>>> >>>> In [1]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >>>> [0, 0]}) >>>> >>>> In [2]: df >>>> Out[2]: >>>> A B C >>>> 0 NaN NaN 0 >>>> 1 NaN 0.0 0 >>>> >>>> In [3]: df.sum() >>>> Out[3]: >>>> A NaN >>>> B 0.0 >>>> C 0.0 >>>> dtype: float64 >>>> >>>> >>>> Pandas is all about propagating NaN's in a reliable and predictable way. >>>> Folks do a series of calculations, preserving NaNs. For examples >>>> >>>> In [5]: df.sum() + 1 >>>> Out[5]: >>>> A NaN >>>> B 1.0 >>>> C 1.0 >>>> dtype: float64 >>>> >>>> makes it perfectly obvious that we have NaN preserving operations >>> >>> >>> We don't always though. Aggregations explicitly skip NaNs: >>> >>> In [3]: pd.Series([1, np.nan]).sum() >>> Out[3]: 1.0 >>> >>> I don't think "how aggregations handle NA" need be consistent with "how >>> binops handle NA". >>> >>>> >>>> Option 1 is essentially: >>>> >>>> In [4]: df.fillna(0).sum() >>>> Out[4]: >>>> A 0.0 >>>> B 0.0 >>>> C 0.0 >>>> dtype: float64 >>>> >>>> Using the same operation as [5], but showing all NaN sum to 0, we have >>>> have the situation >>>> where we are no longer NaN preserving. In any actual real world >>>> calculation this is a disaster >>>> and the worst possible scenario. >>>> >>>> In [6]: df.fillna(0).sum() + 1 >>>> Out[6]: >>>> A 1.0 >>>> B 1.0 >>>> C 1.0 >>>> dtype: float64 >>>> >>>> >>>> Changing this behavior shakes the core tenants of pandas, suddenly we >>>> have a special case >>>> where NaN propagation is not important anymore and worse you may get >>>> wrong answers. >>>> >>>> We have always consistently allowed reduction operations to return NaN >>>> (with the exception of count, which is actually >>>> counting non-nans). >>>> >>>> I would argue that the folks who want guaranteed zero for all-NaN, can >>>> simply fill first. The reverse operation is simply >>>> not possible, nor desired in any actual real world scenario. >>> >>> >>> If we pursue option 1, we would add a keyword to make the reverse >>> operation possible. >>> >>> I think the best analogy here is to `skipna`. The argument "people should >>> fill first" applies equally well to people who >>> say `skipna` should be False by default, because that propagates NaNs >>> (not that anyone *is* arguing that). If we add >>> a keyword to sum like `all_na_is_na` that's equivalent to `skipna`, then >>> we have: >>> >>> >>> >>> df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': [0, >>> >>> 0]}) >>> >>> df.sum(skipna=True, all_na_is_na=False) # the default >>> A 0.0 >>> B 0.0 >>> C 0.0 >>> dtype: float64 >>> >>> >>> df.sum(skipna=True, all_na_is_na=True) >>> A NaN >>> B 0.0 >>> C 0.0 >>> dtype: float64 >>> >>> >>> df.sum(skipna=False, all_na_is_na=True) >>> A NaN >>> B NaN >>> C 0.0 >>> dtype: float64 >>> >>> >>> df.sum(skipna=False, all_na_is_na=False) # ValueError? >>> >>> So we shouldn't be discussing which one is possible. Both will be, it's a >>> matter of choosing the defaults. >>> >>> >>>> >>>> Pandas is not about strictly mathematical purity, rather about real >>>> world utility. >>>> >>>> As for a decent compromise, option 3 is almost certainly the best >>>> option, where we revert the sum([]) == NA to be 0. This would put >>>> us back to pre-0.21.0 pandas without bottleneck, likely the biggest >>>> installed population. This option >>>> would cause the least friction, while maintaining consistency and >>>> practicality. >>>> >>>> Making a radical departure from status quo (e.g. option 1) should have >>>> considered debate and not be 'rushed' in as a >>>> quick 'fix' to a supposed problem. >>> >>> >>> I don't think we're rushing things. I'm not holding out hope for >>> unanimous agreement, but at some point we will >>> need to do a release. I have a slight preference for getting things done >>> sooner, so that 0.21.0 is used by as >>> few people as possible. But getting things right for the next release is >>> the most important thing. >> >> >> In case email is too low bandwidth for this discussion (and how it affects >> the next releases naming and timing), I'm >> free to do a video chat any time today, and post a summary to the mailing >> list on what we cover. How about >> 17:30 UTC (1.5 hours from now?). I'm flexible, though that's 5:30 PM in >> Europe so the soon the better for them. >> >> Tom >> >>> Tom >>> >>>> >>>> Jeff >>>> On Monday, December 4, 2017, 1:27:33 PM EST, Tom Augspurger >>>> wrote: >>>> >>>> >>>> >>>> >>>> On Mon, Dec 4, 2017 at 12:11 PM, Jeff Reback >>>> wrote: >>>> >>>> > We have been discussing this amongst the pandas core developers for >>>> some time, and the general consensus is to adopt Option 1 (sum of >>>> all-NA or empty is 0) as the behavior for sum with skipna=True. >>>> >>>> Actually, no there has not been general consensus among the core >>>> developers. >>>> >>>> >>>> I think that's the preference of the majority though. >>>> >>>> >>>> Everyone loves to say that >>>> >>>> s.sum([NA]) == 0 makes a ton of sense, but then you have my simple >>>> example >>>> from original issue, which Nathaniel did quote and I'll repeat here >>>> (with a small modification): >>>> >>>> In [2]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0]}) >>>> >>>> In [3]: df >>>> Out[3]: >>>> A B >>>> 0 NaN NaN >>>> 1 NaN 0.0 >>>> >>>> In [4]: df.sum() >>>> Out[4]: >>>> A NaN >>>> B 0.0 >>>> dtype: float64 >>>> >>>> >>>> Option 1 is de-facto making [4] have A AND B == 0.0. This loses the fact >>>> that you have 0 >>>> present in B. If you conflate these, you then have a situation where I >>>> do not >>>> know that I had a valid value in B. >>>> >>>> Option 2 (and 3) for that matter preserves [4]. This DOES NOT lose >>>> information. No argument has been presented at all why this should not >>>> hold. >>>> >>>> From [4] it follows that sum([NA]) must be NA. >>>> >>>> >>>> Extending that slightly: >>>> >>>> >>>> In [4]: df = DataFrame({'A' : [np.nan, np.nan], 'B' : [np.nan, 0], 'C': >>>> [0, 0]}) >>>> >>>> In [5]: df.sum() >>>> Out[5]: >>>> A NaN >>>> B 0.0 >>>> C 0.0 >>>> dtype: float64 >>>> >>>> This is why I don't think the "preserving information" argument is >>>> correct. Taking "Preserving information" >>>> to its logical conclusion would return NaN for "B", since that >>>> distinguishes between the sum of all >>>> valid and the the sum with some NaNs. >>>> >>>> I am indifferent whether sum([]) == 0 or NA. Though I would argue that >>>> NA is more consistent with >>>> the rest of pandas (IOW *every* other operation on an empty Series >>>> returns NA). >>>> >>>> > * We should prepare a 0.21.1 release in short order with Option 1 >>>> implemented for sum() (always 0 for empty/all-null) and prod() (1, >>>> respectively) >>>> >>>> I can certainly understand pandas reverting back to the de-facto state >>>> of affairs prior >>>> to 0.21.0, which would be option 3, but a radical change on a minor >>>> release is >>>> not warranted at all. Frankly, we only have (and are likely to get) even >>>> a small >>>> fraction of users opinions on this whole matter. >>>> >>>> >>>> Yeah, agreed that bumping to 0.22 is for the best. >>>> >>>> >>>> Jeff >>>> >>>> >>>> On Mon, Dec 4, 2017 at 11:05 AM, Wes McKinney >>>> wrote: >>>> >>>> We have been discussing this amongst the pandas core developers for >>>> some time, and the general consensus is to adopt Option 1 (sum of >>>> all-NA or empty is 0) as the behavior for sum with skipna=True. >>>> >>>> In a groupby setting, and with categorical group keys, the issue >>>> becomes a bit more nuanced -- if you group by a categorical, and one >>>> of the categories is not observed at all in the dataset, e.g: >>>> >>>> s.groupby(some_categorical).su m() >>>> >>>> This change will necessarily yield a Series containing no nulls -- so >>>> if there is a category containing no data, then the sum for that >>>> category is 0. >>>> >>>> For the sake of algebraic completeness, I believe we should introduce >>>> a new aggregation method that performs Option 2 (equivalent to what >>>> pandas 0.21.0 is currently doing for sum()), so that empty or all-NA >>>> yields NA. >>>> >>>> So the TL;DR is: >>>> >>>> * We should prepare a 0.21.1 release in short order with Option 1 >>>> implemented for sum() (always 0 for empty/all-null) and prod() (1, >>>> respectively) >>>> * Add a new method for Option 2, either in 0.21.1 or in a later release >>>> >>>> We should probably alert the long GitHub thread that this discussion >>>> is taking place before we cut the release. Since GitHub comments can >>>> be permanently deleted at any time, I think it's better for >>>> discussions about significant issues like this to take place on the >>>> permanent public record. >>>> >>>> Thanks >>>> Wes >>>> >>>> On Sun, Dec 3, 2017 at 11:17 AM, Pietro Battiston >>>> wrote: >>>> > Il giorno sab, 02/12/2017 alle 17.32 -0800, Nathaniel Smith ha >>>> > scritto: >>>> >> [...] >>>> > >>>> > I think Nathaniel just expressed my thoughts better than I was/would >>>> > be >>>> > able to! >>>> > >>>> > Pietro >>>> > ______________________________ _________________ >>>> > Pandas-dev mailing list >>>> > Pandas-dev at python.org >>>> > https://mail.python.org/mailma n/listinfo/pandas-dev >>>> ______________________________ _________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailma n/listinfo/pandas-dev >>>> >>>> >>>> >>>> ______________________________ _________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/ mailman/listinfo/pandas-dev >>>> >>>> >>> >> >> > > I'll be in https://appear.in/pandas for the next hour or so. > > Tom From shoyer at gmail.com Fri Dec 8 14:24:43 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 08 Dec 2017 19:24:43 +0000 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1020301426.981032.1512735599876@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 4:20 AM Jeff Reback via Pandas-dev < pandas-dev at python.org> wrote: > Pandas is all about propagating NaN's in a reliable and predictable way. > Folks do a series of calculations, preserving NaNs. > Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs. As many of us have argued, it is quite surprising for sum([], skipna=True) and sum([NaN], skipna=True) to differ. I would argue that the folks who want guaranteed zero for all-NaN, can > simply fill first. The reverse operation is simply > not possible, nor desired in any actual real world scenario. > I think this is a little strong. As Tom points out, we could add another keyword option to sum, but even without that there are plenty of one-liners to achieve the version of sum() where all-NaN/empty inputs result in NaN. For example: df.count() * df.mean() > As for a decent compromise, option 3 is almost certainly the best option, > where we revert the sum([]) == NA to be 0. > Yes, we could choose this if we wanted to differ breaking changes until a later release. But I think it is strictly inferior to either options (1) or (2), both of which are consistent in their own way. > Making a radical departure from status quo (e.g. option 1) should have > considered debate and not be 'rushed' in as a > quick 'fix' to a supposed problem. > We have been debating this for quite some time already (weeks, months?). Nearly everyone who cares has chimed in, including all active core developers. I think it is fair to say that most (but not all) of us think option 1 is the most sensible choice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Dec 8 14:28:29 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 08 Dec 2017 19:28:29 +0000 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> Message-ID: On Mon, Dec 4, 2017 at 2:14 PM Chris Bartak wrote: > I came to pandas from more a of SQL/BI/Excel/etc background rather than a > scientific computing one. > Thanks for sharing your perspective. As one minor point, I'll note that spreadsheets (at least Google Sheets, but probably Excel as well) do define a sum without any valid entries as 0. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Fri Dec 8 19:17:37 2017 From: jreback at yahoo.com (Jeff Reback) Date: Sat, 9 Dec 2017 00:17:37 +0000 (UTC) Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> Message-ID: <958933280.1372724.1512778657693@mail.yahoo.com> >From Stephan Hoyer? > As many of us have argued, it is quite surprising for sum([], skipna=True) and sum([NaN], skipna=True) to differ. I agree whole heartedly for this point. However, these should simply be NaN and not 0. Otherwise you have inconsistency between other reduction operations, e.g. .min(), .mean() and so on. > Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs. Here's where we differ. skipna=True does not mean, let's remove the NaN's and then computethe operation, rather it means, ignore the NaN's in computing the operation. These are distinctand the crux of NaN propagation. This is simply a practical view of things. >From Tom's response above > In [3]: pd.Series([1, np.nan]).sum() Out[3]: 1.0 This is of course exactly the purpose of pandas. ignoring NaNs (skipna=True) is a very sensible default.Sure one could always mask the NaN's themselves and do anything, but again I WILL belabor the point. Pandas is meant to be obvious and sensible. Making all-NaN columns do something different from mostly NaN columns would be a completely odd state of affairs. This would be special casing all-NaN. Why would we want to add special cases? Finally, we have a very very limited response of users / developers here (in this thread). I could be completely wrong,?but I suspect many users have been *relatively* happy with pandas choices over the years. Sure we sometimes make decision?that turn out to be wrong, and we do change them. In this case I am raising my hand for all of the happy users, many of whom may not have commented here. Jeff On Friday, December 8, 2017, 2:24:55 PM EST, Stephan Hoyer wrote: On Fri, Dec 8, 2017 at 4:20 AM Jeff Reback via Pandas-dev wrote: Pandas is all about propagating NaN's in a reliable and predictable way. Folks do a series of calculations, preserving NaNs. Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs. As many of us have argued, it is quite surprising for sum([], skipna=True) and sum([NaN], skipna=True) to differ. I would argue that the folks who want guaranteed zero for all-NaN, can simply fill first. The reverse operation is simply not possible, nor desired in any actual real world scenario. I think this is a little strong. As Tom points out, we could add another keyword option to sum, but even without that there are plenty of one-liners to achieve the version of sum() where all-NaN/empty inputs result in NaN. For example: df.count() * df.mean()? As for a decent compromise, option 3 is almost certainly the best option, where we revert the sum([]) == NA to be 0. Yes, we could choose this if we wanted to differ breaking changes until a later release. But I think it is strictly inferior to either options (1) or (2), both of which are consistent in their own way.? Making a radical departure from status quo (e.g. option 1) should have considered debate and not be 'rushed' in as a?quick 'fix' to a supposed problem. We have been debating this for quite some time already (weeks, months?). Nearly everyone who cares has chimed in, including all active core developers. I think it is fair to say that most (but not all) of us think option 1 is the most sensible choice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Dec 8 19:41:58 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Sat, 09 Dec 2017 00:41:58 +0000 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 4:17 PM Jeff Reback wrote: > From Stephan Hoyer > > > Yes, in most cases. But this isn't what skipna=True does, which is > explicitly an indication to skip NaNs. > > Here's where we differ. skipna=True does not mean, let's remove the NaN's > and then compute > the operation, rather it means, ignore the NaN's in computing the > operation. These are distinct > and the crux of NaN propagation. This is simply a practical view of things. > I think "skipping" vs "ignore in the calculation" is too subtle of a distinction to insist on users understanding from a docstring/argument name. Sure one could always mask the NaN's themselves and do anything, but again > I WILL belabor the point. Pandas > is meant to be obvious and sensible. > If nothing else, this debate should make it very clear that there is no single "obvious and sensible" answer to how an empty or all-null sum should work. If it would help, I volunteer to survey my Twitter followers about which behavior they think is obvious ;). The best we can do is consider various use cases and clearly explain our reasoning/decision, with the recognition that it is not possible to satisfy everyone. Finally, we have a very very limited response of users / developers here > (in this thread). I could be completely wrong, > but I suspect many users have been *relatively* happy with pandas choices > over the years. > Rather I would say that most users probably don't actually care about this debate either way. This is edge case behavior that doesn't come up everyday. Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Sat Dec 9 07:10:44 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sat, 9 Dec 2017 06:10:44 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> Message-ID: On Fri, Dec 8, 2017 at 6:41 PM, Stephan Hoyer wrote: > On Fri, Dec 8, 2017 at 4:17 PM Jeff Reback wrote: > >> From Stephan Hoyer >> > >> > Yes, in most cases. But this isn't what skipna=True does, which is >> explicitly an indication to skip NaNs. >> > >> Here's where we differ. skipna=True does not mean, let's remove the >> NaN's and then compute >> the operation, rather it means, ignore the NaN's in computing the >> operation. These are distinct >> and the crux of NaN propagation. This is simply a practical view of >> things. >> > > I think "skipping" vs "ignore in the calculation" is too subtle of a > distinction to insist on users understanding from a docstring/argument name. > > Sure one could always mask the NaN's themselves and do anything, but again >> I WILL belabor the point. Pandas >> is meant to be obvious and sensible. >> > > If nothing else, this debate should make it very clear that there is no > single "obvious and sensible" answer to how an empty or all-null sum > should work. If it would help, I volunteer to survey my Twitter followers > about which behavior they think is obvious ;). > > The best we can do is consider various use cases and clearly explain our > reasoning/decision, with the recognition that it is not possible to satisfy > everyone. > Finally, we have a very very limited response of users / developers here >> (in this thread). I could be completely wrong, >> but I suspect many users have been *relatively* happy with pandas choices >> over the years. >> > > Rather I would say that most users probably don't actually care about this > debate either way. This is edge case behavior that doesn't come up everyday. > Agreed. Let's just emit a warning on all-NA or empty sums and *then* we'll start hearing from people :) (that's a joke in case it wasn't clear). The fact that we lived with differing behavior based on bottleneck for so long is evidence for this not mattering too much. Thoughts Jeff? I'm trying to gauge where you're at and what the points of disagreement are, as you seem to be pretty strongly against option 1 and I don't think this should go forward when we're this split on the issue. Do you agree that there isn't an obviously correct solution? That any option is valid, and it's a matter of picking good defaults, providing options, and documenting things well? Statements like "In any actual real world calculation this is a disaster and the worst possible scenario." make me think you're strongly -1 on option 1. Tom > > Cheers, > Stephan > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Sun Dec 10 11:09:34 2017 From: jreback at yahoo.com (Jeff Reback) Date: Sun, 10 Dec 2017 16:09:34 +0000 (UTC) Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> Message-ID: <1924165589.1882315.1512922174822@mail.yahoo.com> > I think "skipping" vs "ignore in the calculation" is too subtle of a distinction to insist on users understanding from a docstring/argument name. I agree. When I see skip, I dont' assume that we should simply remove them and recompute. I understand this is what numpydoes, but it is NOT what pandas does, nor has ever done. Again this would just shock people. I am pushing back on this entire issue because it seems that lots of folks are just assuming, since numpy does it and R does it is automaticallycorrect. Well, pandas has never completely followed semantics, just because someone else does it. Sure this is an edge case, its only one function, but again, special casing this one function out of many does not make much sense. In any event, whatever transpires, needs to be properly coded and put into master, not rushed out into the world. I suspect changing?to option 1 can cause quite a bit of pain on grouping / categoricals. Jeff On Saturday, December 9, 2017, 7:11:07 AM EST, Tom Augspurger wrote: On Fri, Dec 8, 2017 at 6:41 PM, Stephan Hoyer wrote: On Fri, Dec 8, 2017 at 4:17 PM Jeff Reback wrote: From?Stephan Hoyer? > Yes, in most cases. But this isn't what skipna=True does, which is explicitly an indication to skip NaNs. Here's where we differ.?skipna=True?does not mean, let's remove the NaN's and then computethe operation, rather it means, ignore the NaN's in computing the operation. These are distinctand the crux of NaN propagation. This is simply a practical view of things. I think "skipping" vs "ignore in the calculation" is too subtle of a distinction to insist on users understanding from a docstring/argument name. Sure one could always mask the NaN's themselves and do anything, but again I WILL belabor the point. Pandas is meant to be obvious and sensible. If nothing else, this debate should make it very clear that there is no single "obvious and sensible" answer to how ?an empty or all-null sum should work. If it would help, I volunteer to survey my Twitter followers about which behavior they think is obvious ;). The best we can do is consider various use cases and clearly explain our reasoning/decision, with the recognition that it is not possible to satisfy everyone.? Finally, we have a very very limited response of users / developers here (in this thread). I could be completely wrong,? but I suspect many users have been *relatively* happy with pandas choices over the years. Rather I would say that most users probably don't actually care about this debate either way. This is edge case behavior that doesn't come up everyday. Agreed. Let's just emit a warning on all-NA or empty sums and *then* we'll start hearing from people :) (that's a joke in case it wasn't clear). The fact that we lived with differing behavior based on bottleneck for so long is evidence for this not mattering too much. Thoughts Jeff? I'm trying to gauge where you're at and what the points of disagreement are, as you seem to be pretty strongly against option 1 and I don't think this should go forward when we're this split on the issue. Do you agree that there isn't an obviously correct solution? That any option is valid, and it's a matter of picking good defaults, providing options, and documenting things well? Statements like "In any actual real world calculation this is a disaster and the worst possible scenario." make me think you're strongly -1 on option 1. Tom ? Cheers,Stephan ______________________________ _________________ Pandas-dev mailing list Pandas-dev at python.org https://mail.python.org/ mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From sds at gnu.org Sun Dec 10 15:55:54 2017 From: sds at gnu.org (Sam Steingold) Date: Sun, 10 Dec 2017 15:55:54 -0500 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: (Joris Van den Bossche's message of "Fri, 1 Dec 2017 02:09:10 +0100") References: Message-ID: Hi, > * Joris Van den Bossche [2017-12-01 02:09:10 +0100]: > > In pandas 0.21.0 we changed the behaviour of the sum method for empty or > all-NaN Series (to consistently return NaN), see the what's note > . > This change lead to some discussion on github whether this was the right > choice we made. I am afraid I must disagree with the _framing_ of the question. You are talking about "empty or all-NA" series, i.e., series without any valid data (i.e., s.isnull().all() is true). Instead, the true question is "some-NA" series, i.e., series contaminated with invalid/missing data (i.e., s.isnull().any() is true). If some of your data is missing (== NA/NaN/None is present), you can contemplate what to do: ignore the missing records and work with the available data or return NA. However, if there is no missing data (NA/NaN/None), there is _no_ question of what is the right approach - you just use what you have, mathematically. The only situation where my framing is different from yours is when the data set is empty (i.e., the list or series has 0 length), and my point here is that, mathematically, there is _NO_ question what the right answer is. NB: I understand and appreciate that math is not your only consideration, but, given that your target audience (customers) are (applied) mathematicians, you might want to consider our opinion when making design decisions that affect us. So, what is sum([])? It is 0 because addition is associative: sum(list1 + list2) == sum(list1) + sum(list2) Since list1 == list1 + [] for any list1, we must have sum(list1) == sum(list1 + []) == sum(list1) + sum([]) thus sum([])==0. Therefore pd.concat([s1,s2]).sum() should be the same as s1.sum() + s2.sum() for any s1 and s2, and, indeed, in 0.20.3 (but not in in 0.21): --8<---------------cut here---------------start------------->8--- >>> pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum() True --8<---------------cut here---------------end--------------->8--- `pd.Series([]).sum()` should be 0 - because math says so. Returning anything else violates associativity of addition and is a bug. Moreover, _all_ known languages/systems do return 0 on empty sums (with a prominent exception of SQL -- where Postgres and SQLite say in their docs that they implement the behavior required by the standard but, since it is obviously wrong, they also offer non-standard functionality which does the right thing). Now, let us step back. The reason an empty set has to sum up to 0 is that 0 is the neutral element for addition: 0+x=x for any x. This means that for other associative group operations the operation on an empty set is the neutral element of that operation, e.g.: product([]) = 1 because 1*x=x for any x max([]) = -inf because max(-inf,x) = x for any x min([]) = +inf because min(+inf,x) = x for any x (max and min -- only if you can handle infinities consistently everywhere, otherwise raising an exception is fine). mean and std stand aside: these are _not_ basic arithmetic operations, they are defined based on other operations, and thus: mean([]) = NA (or, better yet, raises an exception) std([]) = NA (or, better yet, raises an exception) std([x]) = NA (or, better yet, raises an exception) Again, while I do understand that math is not the only consideration, I beg you to remember that your customer is an applied mathematician like yours truly and we have certain expectations from the basic math operations. Please do not surprise us like this! ;-) If you do, you will get an endless stream of bug reports that sum([]) must be 0 no matter how you handle missing data. Thank you very much for your attention. PS. ISTR a claim that the Series.sum method is somehow a different beast from addition of scalars. Are its authors suggesting that the identity Series([1,2,3]).sum() == 1+2+3 is a happy accident, not really guaranteed by the contract of the Series class? -- Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504 http://steingoldpsychology.com http://www.childpsy.net http://camera.org http://memri.org https://ffii.org http://islamexposedonline.com Selling grief is easier than buying happiness. From shoyer at gmail.com Sun Dec 10 16:19:49 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Dec 2017 21:19:49 +0000 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1924165589.1882315.1512922174822@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> <1924165589.1882315.1512922174822@mail.yahoo.com> Message-ID: On Sun, Dec 10, 2017 at 8:09 AM Jeff Reback wrote: > Sure this is an edge case, its only one function, but again, special > casing this one function out of many does not make much sense. > What constitutes "special casing" is matter of perspective: - Your argument*: Every other aggregation returns NA for all NA input. Thus returning 0 for the sum of all NA input would be a special case. - My argument: Every other aggregation defines the behavior of "skipna" by dropping NA elements and then applying the operation. Thus returning NA for the sum of all NA input would be a special case. Likewise, consistency with other analytics systems like NumPy, MATLAB and R is desirable, but I agree that it is not a decisive argument on its own. Consistency with SQL is also desirable, but it isn't possible to be consistent with both SQL and NumPy. * please correct me if I paraphrased this incorrectly. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Dec 10 16:25:53 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 10 Dec 2017 16:25:53 -0500 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: > remember that your customer is an applied mathematician Please, please do not use the term "customer" to apply to a user of pandas. A customer is someone who buy things with money. We are not receiving money from you and correspondingly do not have the kinds of obligations that you are suggesting. > given that your target audience (customers) are (applied) mathematicians We do not take this as a given. Thanks Wes On Sun, Dec 10, 2017 at 3:55 PM, Sam Steingold wrote: > Hi, > >> * Joris Van den Bossche [2017-12-01 02:09:10 +0100]: >> >> In pandas 0.21.0 we changed the behaviour of the sum method for empty or >> all-NaN Series (to consistently return NaN), see the what's note >> . >> This change lead to some discussion on github whether this was the right >> choice we made. > > I am afraid I must disagree with the _framing_ of the question. > You are talking about "empty or all-NA" series, i.e., series without any > valid data (i.e., s.isnull().all() is true). > Instead, the true question is "some-NA" series, i.e., series > contaminated with invalid/missing data (i.e., s.isnull().any() is true). > > If some of your data is missing (== NA/NaN/None is present), you can > contemplate what to do: ignore the missing records and work with the > available data or return NA. > > However, if there is no missing data (NA/NaN/None), there is _no_ > question of what is the right approach - you just use what you have, > mathematically. > > The only situation where my framing is different from yours is when the > data set is empty (i.e., the list or series has 0 length), and my point > here is that, mathematically, there is _NO_ question what the right > answer is. > > NB: I understand and appreciate that math is not your only > consideration, but, given that your target audience (customers) are > (applied) mathematicians, you might want to consider our opinion when > making design decisions that affect us. > > So, what is sum([])? It is 0 because addition is associative: > sum(list1 + list2) == sum(list1) + sum(list2) > Since list1 == list1 + [] for any list1, we must have > sum(list1) == sum(list1 + []) == sum(list1) + sum([]) > thus sum([])==0. > Therefore pd.concat([s1,s2]).sum() should be the same as s1.sum() + > s2.sum() for any s1 and s2, and, indeed, in 0.20.3 (but not in in 0.21): > --8<---------------cut here---------------start------------->8--- >>>> pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum() > True > --8<---------------cut here---------------end--------------->8--- > > `pd.Series([]).sum()` should be 0 - because math says so. > Returning anything else violates associativity of addition and is a bug. > > Moreover, _all_ known languages/systems do return 0 on empty sums (with > a prominent exception of SQL -- where Postgres and SQLite say in their > docs that they implement the behavior required by the standard but, > since it is obviously wrong, they also offer non-standard functionality > which does the right thing). > > Now, let us step back. The reason an empty set has to sum up to 0 is > that 0 is the neutral element for addition: 0+x=x for any x. > This means that for other associative group operations the operation on > an empty set is the neutral element of that operation, e.g.: > > product([]) = 1 because 1*x=x for any x > max([]) = -inf because max(-inf,x) = x for any x > min([]) = +inf because min(+inf,x) = x for any x > (max and min -- only if you can handle infinities consistently > everywhere, otherwise raising an exception is fine). > > mean and std stand aside: these are _not_ basic arithmetic operations, > they are defined based on other operations, and thus: > > mean([]) = NA (or, better yet, raises an exception) > std([]) = NA (or, better yet, raises an exception) > std([x]) = NA (or, better yet, raises an exception) > > Again, while I do understand that math is not the only consideration, I > beg you to remember that your customer is an applied mathematician like > yours truly and we have certain expectations from the basic math > operations. > Please do not surprise us like this! ;-) > If you do, you will get an endless stream of bug reports that sum([]) > must be 0 no matter how you handle missing data. > > Thank you very much for your attention. > > PS. ISTR a claim that the Series.sum method is somehow a different beast > from addition of scalars. Are its authors suggesting that the identity > Series([1,2,3]).sum() == 1+2+3 is a happy accident, not really > guaranteed by the contract of the Series class? > > -- > Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504 > http://steingoldpsychology.com http://www.childpsy.net http://camera.org > http://memri.org https://ffii.org http://islamexposedonline.com > Selling grief is easier than buying happiness. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From shoyer at gmail.com Sun Dec 10 16:28:33 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Dec 2017 21:28:33 +0000 Subject: [Pandas-dev] [pydata] Re: Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: Message-ID: On Sun, Dec 10, 2017 at 12:55 PM Sam Steingold wrote: > NB: I understand and appreciate that math is not your only > consideration, but, given that your target audience (customers) are > (applied) mathematicians, you might want to consider our opinion when > making design decisions that affect us. Sam, thank you for sharing your perspective. Yes, mathematicians are part of our audience for pandas. But to be clear, they are a relatively small portion. A big part of the reason why pandas has become so popular is that it is now used by a wide range of users for data analytics, including many without formal mathematical training. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ryan at theoremlp.com Tue Dec 12 18:00:39 2017 From: ryan at theoremlp.com (Ryan Bressler) Date: Tue, 12 Dec 2017 16:00:39 -0700 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) Message-ID: I posted some brief feedback on the Issue tracker and Joris asked me to weight in here with our experience. First off some numbers. We maintain about ~30k line scientific python with a team of ~6 (and growing) researchers and engineers. I've just started to audit the code base for this issue but a quick grep reveals about 170 invocations of "sum" though some of those are numpy (more on that in second). I recently tried to upgrade to pandas .21 and a large number or our unit tests failed. For now we'll stay at .20 but this incident is also causing us to discuss limiting the use of pandas in our code base. We are in the financial industry and a lot of these invocations sum monetary amounts where pd.Series([]).sum() == 0 makes sense and may even be a common occurrence especially when aggregating via groupby or similar. Ie questions like "how many total dollars of apples were sold on Tuesday" are common and often have answer 0. However, the less domain specific and perhaps more insidious way this breaks our code is that we use a mix of pandas and numpy. We tend to use pandas for dealing with mixed data types and prototyping in pandas and then using pure numpy in areas where we care about speed or need to interface with scikit learn etc. This change means that pandas and numpy collections have very similar interface and but very different behavior. Further we have this nasty behavior: >>> np.sum(pd.Series([])) nan At first glance there isn't really a clean or consistent way for us to deal with this. If it isn't reverted we're in for a lot of careful auditing and special casing. For many sections of code it may be simplest to just eliminate pandas use. We are quite strict about dependency management which will allow us to avoid the problematic versions. However, having worked in the academic research previously I'd encourage you all to minimize headaches for downstream package maintainers / users by minimizing the number of releases with this inconsistent behavior. Thanks for reading and for all your hard work: Ryan Bressler Theorem LP -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Tue Dec 12 19:23:06 2017 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 13 Dec 2017 01:23:06 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1924165589.1882315.1512922174822@mail.yahoo.com> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> <1924165589.1882315.1512922174822@mail.yahoo.com> Message-ID: <1513124586.17022.97.camel@pietrobattiston.it> Il giorno dom, 10/12/2017 alle 16.09 +0000, Jeff Reback via Pandas-dev ha scritto: > > I think "skipping" vs "ignore in the calculation" is too subtle of > a distinction to insist on users understanding from a > docstring/argument name. > > I agree. When I see skip, I dont' assume that we should simply remove > them and recompute. I understand this is what numpy > does, but it is NOT what pandas does, nor has ever done. Again this > would just shock people. Not only I'm not shocked by this possibility, but after reading multiple times, I still fail to understand how "ignore in the calculation" conceptually differs from "skip and then calculate". > I am pushing back on this entire issue because it seems that lots of > folks are just assuming, since numpy does it and R does it is > automatically > correct. Well, pandas has never completely followed semantics, just > because someone else does it. While I would not put numpy and R at the same level - most users using pandas will sooner or later use numpy, while the same might not be true for R - I agree with your general argument. However for me the point is not "we should do what they did". It rather is "if we do something different, either they should be regretting their decision, or the need of the users are different... or we are wrong". Now, from this discussion I understand that?it is SQL developers who are regretting a design decision, and it is not obvious to me how user expectations should differ between R and pandas (that is, why R users should dislike a "practical way to view things"). Two more quick points I would like to add: - all else equal, it is better if the (default, at least) behavior can be described with less words than more: and this is where mathematical purity is positively correlated with practicality - I entirely agree with Stephan when he says that most users probably just never encountered the edge case we are discussing about... at least if he means sum([NA]), which is indeed pretty rare (despite having a pretty clear preference on what I would like the behavior to be if I happened to face it, I admit I might have never taken the sum of a variable with only missing values). If we are talking about sum([]), however, this is a different story, and I'm ready to bet that some previously written code of mine _was_ broken by 0.21.0. This for me means two things: that on sum([NA]), we will hardly get much users feedback "out of experience", and that on sum([]), assuming we agree to revert to the pre-0.21.0 behavior, sooner is better than later. Pietro From tom.augspurger88 at gmail.com Tue Dec 12 23:24:47 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 12 Dec 2017 22:24:47 -0600 Subject: [Pandas-dev] ANN: pandas v0.21.1 released Message-ID: Hi all, I'm happy to announce pandas 0.21.1 has been released. This is a minor bug-fix release in the 0.21.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version. Highlights include: - Temporarily restore matplotlib datetime plotting functionality. This should resolve issues for users who relied implicitly on pandas to plot datetimes with matplotlib. See here . - Improvements to the Parquet IO functions introduced in 0.21.0. See here . See the v0.21.1 Whatsnew overview for an extensive list of all the changes for 0.21.1. - Tom *What is it:* pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ?relational? or ?labeled? data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. *How to get it:* Source tarballs and windows/mac/linux wheels are available on PyPI (thanks to Christoph Gohlke for the windows wheels, and to Matthew Brett for setting up the mac/linux wheels). Conda packages currently building for conda forge, and already available on the default channel. *Issues:* Please report any issues on our issue tracker: https://github.com/py data/pandas/issues *Thanks to all the contributors:* A total of 46 people contributed to this release. People with a ?+? by their names contributed a patch for the first time. - Aaron Critchley + - Alex Rychyk - Alexander Buchkovsky + - Alexander Michael Schade + - Chris Mazzullo - Cornelius Riemenschneider + - Dave Hirschfeld + - David Fischer + - David Stansby + - Dror Atariah + - Eric Kisslinger + - Hans + - Ingolf Becker + - Jan Werkmann + - Jeff Reback - Joris Van den Bossche - J?rg D?pfert + - Kevin Kuhl + - Krzysztof Chomski + - Leif Walsh - Licht Takeuchi - Manraj Singh + - Matt Braymer-Hayes + - Michael Waskom + - Mie~~~ + - Peter Hoffmann + - Robert Meyer + - Sam Cohan + - Sietse Brouwer + - Sven + - Tim Swast - Tom Augspurger - Wes Turner - William Ayd + - Yee Mey + - bolkedebruin + - cgohlke - derestle-htwg + - fjdiod + - gabrielclow + - gfyoung - ghasemnaddaf + - jbrockmendel - jschendel - miker985 + - topper-123 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Dec 19 19:08:24 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 20 Dec 2017 01:08:24 +0100 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: <1513124586.17022.97.camel@pietrobattiston.it> References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> <1924165589.1882315.1512922174822@mail.yahoo.com> <1513124586.17022.97.camel@pietrobattiston.it> Message-ID: Hi all, Thanks all for the feedback, that's really appreciated! There clearly is still some disagreement, but hopefully we can decide shortly on how to move forward. In the mean time we are discussing on github how the API could look like to switch between both options (returning 0 vs NA for empty/all-NA series) on this issue: https://github.com/pandas-dev/pandas/issues/18678#issuecomment-352885513 We are now thinking of a "empty_is_na=True/False" keyword instead of adding a new method, but that's certainly not yet set in stone. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Dec 21 16:34:22 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 21 Dec 2017 15:34:22 -0600 Subject: [Pandas-dev] [pydata] Feedback request for return value of empty or all-NA sum (0 or NA?) In-Reply-To: References: <1512317864.2389.102.camel@pietrobattiston.it> <1020301426.981032.1512735599876@mail.yahoo.com> <958933280.1372724.1512778657693@mail.yahoo.com> <1924165589.1882315.1512922174822@mail.yahoo.com> <1513124586.17022.97.camel@pietrobattiston.it> Message-ID: A quick status update for those not following along on GitHub. Development is happening in https://github.com/pandas-dev/pandas/pull/18876 adding a `min_count` keyword to sum and prod. That PR is backwards compatible. A followup API-breaking PR will change the default min_count from 1 to 0. The next release will be 0.22.0 and will be identical to 0.21.1 + those changes. We hope to have that out before long. Tom On Tue, Dec 19, 2017 at 6:08 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > Thanks all for the feedback, that's really appreciated! > There clearly is still some disagreement, but hopefully we can decide > shortly on how to move forward. > > In the mean time we are discussing on github how the API could look like > to switch between both options (returning 0 vs NA for empty/all-NA series) > on this issue: https://github.com/pandas-dev/pandas/issues/18678# > issuecomment-352885513 > We are now thinking of a "empty_is_na=True/False" keyword instead of > adding a new method, but that's certainly not yet set in stone. > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.w.augspurger at gmail.com Sun Dec 31 07:46:19 2017 From: tom.w.augspurger at gmail.com (Tom Augspurger) Date: Sun, 31 Dec 2017 06:46:19 -0600 Subject: [Pandas-dev] ANN: Pandas v0.22.0 released Message-ID: Hi all, I'm happy to announce pandas 0.22.0 has been released. This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users upgrade to this version after carefully reading the release note. The only changes are: - The sum of an empty or all-*NA* Series is now 0 - The product of an empty or all-*NA* Series is now 1 - We?ve added a min_count parameter to .sum() and .prod() controlling the minimum number of valid values for the result to be valid. If fewer than min_count non-*NA* values are present, the result is *NA*. The default is 0. To return NaN, the 0.21 behavior, use min_count=1. See the pandas 0.22.0 whatsnew overview for further explanation of all the places in the library this affects. - Tom --- *What is it:* pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ?relational? or ?labeled? data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. *How to get it:* Source tarballs and windows/mac/linux wheels are available on PyPI (thanks to Christoph Gohlke for the Windows wheels, and to Matthew Brett for setting up the Mac / Linux wheels). Conda packages are available on the default and conda-forge channels. *Issues:* Please report any issues on our issue tracker: https://github.com/py data/pandas/issues -------------- next part -------------- An HTML attachment was scrubbed... URL: