[Numpy-discussion] sorting -inf, nan, inf

Tue Sep 19 21:45:36 EDT 2006

Charles R Harris wrote:
>
>
> On 9/19/06, *A. M. Archibald* <peridot.faceted at gmail.com 
> <mailto:peridot.faceted at gmail.com>> wrote:
>
>     On 19/09/06, Charles R Harris <charlesr.harris at gmail.com
>     <mailto:charlesr.harris at gmail.com>> wrote:
>     >
>     >
>     >
>     > For floats we could use something like:
>     >
>     > lessthan(a,b) := a < b || (a == nan && b != nan)
>
I believe this would have to be some sort of isnan macros since 
everything compares not equal to nan. I forget the correct spelling at 
the moment though,

>     >
>     > Which would put all the nans at one end and might not add too
>     much overhead.
>
>     You could put an any(isnan()) out front and run this slower version
>     only if there are any NaNs (also, you can't use == for NaNs, you have
>     to use C isNaN). But I'm starting to see the wisdom in simply throwing
>     an exception, since sorting is not well-defined with NaNs.
>
>
> Looks like mergesort can be modified to sort around the NaNs without 
> too much trouble if there is a good isnan function available: just 
> cause the pointers to skip over them. I see that most of the isnan 
> stuff seems to be in the ufunc source and isn't terribly simple. Could 
> be broken out into a separate include, I suppose.
>
> I still wonder if it is worth the trouble. As to raising an exception, 
> I seem to recall reading somewhere that exception code tends to be 
> expensive, I haven't done any benchmarks myself.

I'm still somewhat mystified by the desire to move the nans to one end 
of the sorted object. I see two scenarios:

   1. The user is not expecting to have nans in the data. In this case
      moving the nans to end is not helpful. The resulting array is
      still not sorted in the sense that a[i-1]<=a[i]<=a[i+1] does not
      hold and thus is likely to break code that relies on the array
      being sorted. The most prominent example of which is searchsorted.
      In this case you really want to raise an exception if possible
      since no good will come from letting the code continue to run. In
      this case the time in involved in throwing and catching an
      exception is irrelevant.
   2. The user *is* expecting to have nans in the data. This is
      presumably the case that the sorting-nans-to-the-end idea is aimed
      at. So far at least the suggested use has been to sort and then
      strip the nans. I suggest that a better approach is to test for
      and strip the nans before the sort. For example:

    # a is an array that may have some nans
    # you can do this more pithily, but I'm aiming to minimize isnan calls
    # note that this *sometimes* makes a copy.
    nanmask = isnan(a)
    if sometrue(nanmask):
        a = a[not nanmask]
    a.sort()
    #.....

    I presume that isnan is somewhat more expensive than the basic '<'
    operator. In the proposed sort to end version we need N*log(N) isnan
    calls versus just N in the above case. The sort to end case probably
    won't look any cleaner than the above either since you still need to
    count the nans to determine how many to strip.

Perhaps there's some use for the sort to end behaviour that I'm missing, 
but the raise an exception behaviour sure looks a lot more appealing to me.

Here's a strawman proposal:

    Sort the array. Then examine numpy.geterr()['invalid']. If it is not
    'ignore', then check examine sometrue(isnan(thearray)). If the
    latter is true then raise and error, issue a warning or call the
    error reporting functioni as appropriate. Note that we always sort
    the array to be consistent with the behaviour of the ufuncs that
    proceed even when they end up raising an exception.

-tim