[Numpy-discussion] What should be the result in some statistics corner cases?

Mon Jul 15 17:34:22 EDT 2013

On Mon, Jul 15, 2013 at 2:44 PM, <josef.pktd at gmail.com> wrote:

> On Mon, Jul 15, 2013 at 4:24 PM,  <josef.pktd at gmail.com> wrote:
> > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
> >> <charlesr.harris at gmail.com> wrote:
> >>> Let me try to summarize. To begin with, the environment of the nan
> functions
> >>> is rather special.
> >>>
> >>> 1) if the array is of not of inexact type, they punt to the non-nan
> >>> versions.
> >>> 2) if the array is of inexact type, then out and dtype must be inexact
> if
> >>> specified
> >>>
> >>> The second assumption guarantees that NaN can be used in the return
> values.
> >>
> >> The requirement on the 'out' dtype only exists because currently the
> >> nan function like to return nan for things like empty arrays, right?
> >> If not for that, it could be relaxed? (it's a rather weird
> >> requirement, since the whole point of these functions is that they
> >> ignore nans, yet they don't always...)
> >>
> >>> sum and nansum
> >>>
> >>> These should be consistent so that empty sums are 0. This should cover
> the
> >>> empty array case, but will change the behaviour of nansum which
> currently
> >>> returns NaN if the array isn't empty but the slice is after NaN
> removal.
> >>
> >> I agree that returning 0 is the right behaviour, but we might need a
> >> FutureWarning period.
> >>
> >>> mean and nanmean
> >>>
> >>> In the case of empty arrays, an empty slice, this leads to 0/0. For
> Python
> >>> this is always a zero division error, for Numpy this raises a warning
> and
> >>> and returns NaN for floats, 0 for integers.
> >>>
> >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0
> occurs. In
> >>> the special case where dtype=int, the NaN is cast to integer.
> >>>
> >>> Option1
> >>> 1) mean raise error on 0/0
> >>> 2) nanmean no warning, return NaN
> >>>
> >>> Option2
> >>> 1) mean raise warning, return NaN (current behavior)
> >>> 2) nanmean no warning, return NaN
> >>>
> >>> Option3
> >>> 1) mean raise warning, return NaN (current behavior)
> >>> 2) nanmean raise warning, return NaN
> >>
> >> I have mixed feelings about the whole np.seterr apparatus, but since
> >> it exists, shouldn't we use it for consistency? I.e., just do whatever
> >> numpy is set up to do with 0/0? (Which I think means, warn and return
> >> NaN by default, but this can be changed.)
> >>
> >>> var, std, nanvar, nanstd
> >>>
> >>> 1) if ddof > axis(axes) size, raise error, probably a program bug.
> >>> 2) If ddof=0, then whatever is the case for mean, nanmean
> >>>
> >>> For nanvar, nanstd it is possible that some slice are good, some bad,
> so
> >>>
> >>> option1
> >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
> >>>
> >>> option2
> >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice
> >>
> >> I don't really have any intuition for these ddof cases. Just raising
> >> an error on negative effective dof is pretty defensible and might be
> >> the safest -- it's a easy to turn an error into something sensible
> >> later if people come up with use cases...
> >
> > related why does reduceat not have empty slices?
> >
> >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
> > array([ 6,  4, 11,  7,  7])
> >
> >
> > I'm in favor of returning nans instead of raising exceptions, except
> > if the return type is int and we cannot cast nan to int.
> >
> > If we get functions into numpy that know how to handle nans, then it
> > would be useful to get the nans, so we can work with them
> >
> > Some cases where this might come in handy are when we iterate over
> > slices of an array that define groups or category levels with possible
> > empty groups *)
> >
> >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
> >>>> x = np.arange(9)
> >>>> [x[idx==ii].mean() for ii in range(4)]
> > [1.5, 5.0, nan, 7.5]
> >
> > instead of
> >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]
> > [1.5, 5.0, 7.5]
> >
> > same for var, I wouldn't have to check that the size is larger than
> > the ddof (whatever that is in the specific case)
> >
> > *) groups could be empty because they were defined for a larger
> > dataset or as a union of different datasets
>
> background:
>
> I wrote several robust anova versions a few weeks ago, that were
> essentially list comprehension as above. However, I didn't allow nans
> and didn't check for minimum size.
> Allowing for empty groups to return nan would mainly be a convenience,
> since I need to check the group size only once.
>
> ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
> for tests of correlation ddof=2   IIRC
> so we would need to check for the corresponding minimum size that n-ddof>0
>
> "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0)
> which is always non-negative but might result in a zero-division
> error. :)
>
> I don't think making anything conditional on ddof>0 is useful.
>
>
So how would you want it?

To summarize the problem areas:

1) What is the sum of an empty slice? NaN or 0?
2) What is mean of empy slice? NaN, NaN and warn, or error?
3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error?
4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?

I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the
warning can be turned into an error by the user. The errstate context
manager would be good for that.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130715/3bb24c8e/attachment.html>