[Numpy-discussion] sorting -inf, nan, inf

Charles R Harris charlesr.harris at gmail.com
Tue Sep 19 17:21:13 EDT 2006


On 9/19/06, Tim Hochberg <tim.hochberg at ieee.org> wrote:
>
> Charles R Harris wrote:
> >
> >
> > On 9/19/06, *Tim Hochberg* <tim.hochberg at ieee.org
> > <mailto:tim.hochberg at ieee.org>> wrote:
> >
> >     A. M. Archibald wrote:
> >     > On 19/09/06, Tim Hochberg <tim.hochberg at ieee.org
> >     <mailto:tim.hochberg at ieee.org>> wrote:
> >     >
> >     >> Keith Goodman wrote:
> >     >>
> >     >>> In what order would you like argsort to sort the values -inf,
> >     nan, inf?
> >     >>>
> >     >>>
> >     >> Ideally, -inf should sort first, inf should sort last and nan
> >     should
> >     >> raise an exception if present.
> >     >>
> >     >> -tim
> >     >>
> >     >
> >     > Mmm. Somebody who's working with NaNs has more or less already
> >     decided
> >     > they don't want to be pestered with exceptions for invalid data.
> >     Do you really think so? In my experience NaNs are nearly always
> >     just an
> >     indication of a mistake somewhere that didn't get trapped for one
> >     reason
> >     or another.
> >
> >     >  I'd
> >     > be happy if they wound up at either end, but I'm not sure it's
> worth
> >     > hacking up the sort algorithm when a simple isnan() can pull
> >     them out.
> >     >
> >     Moving them to the end seems to be the worst choice to me. Leaving
> >     them
> >     alone is fine with me. Or raising an exception would be fine. Or
> doing
> >     one or the other depending on the error mode settings would be even
> >     better if it is practical.
> >
> >     > What's happening now is that NaN<a, NaN==a, and NaN>a are all
> >     false,
> >     > which rather confuses the sorting algorithm. But as long as it
> >     doesn't
> >     > actually *break* (that is, leave some of the non-NaNs incorrectly
> >     > sorted) I don't care.
> >     >
> >     Is that true? Are all of numpy's sorting algorithms robust against
> >     nontransitive objects laying around? The answer to that appears to
> be
> >     no. Try running this a couple of times to see what I mean:
> >
> >     n = 10
> >     for i in range(10):
> >         for kind in 'quicksort', 'heapsort', 'mergesort':
> >                  a = rand(n)
> >                  b = a.copy()
> >                  a[n//2] = nan
> >                  a.sort(kind=kind)
> >                  b.sort(kind=kind)
> >                  assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b)
> >
> >
> >     The values don't correctly cross the inserted NaN and the sort is
> >     incorrect.
> >
> >
> > Except for heapsort those are doing insertion sorts because n is so
> > small. Heapsort is trickier to understand because of the way the heap
> > is formed and values pulled off.
> I'm not sure where the breakpoint is, but I was seeing failures for all
> three sort types with N as high as 10000. I suspect that they're all
> broken in the presence of  NaNs.  I further suspect you'd need some
> punishingly slow n**2 algorithm to be robust in the presence of NaNs.


Quicksort and Mergesort are divide and conquer types. When the pieces get
below a certain length they finish up with insertion sort as it is more
efficient for small bits. In numpy the breakover points are 15 and 20
respectively. I believe microsoft did a project on this and ended up with a
number somewhere around 7, but I didn't see that when doing the tuning for
numarray way back when. Not that I did a *lot* of careful tuning work ;)

> I'll look into that. I bet searchsorted gets messed up by nan's. Do
> > you think it worthwhile to worry about it?
>
> No. But that's because it's my contention is that any sequence with NaNs
> in it is *not sorted* and thus isn't suitable input for searchsorted.


If this sort of thing can cause unexpected errors I wonder if it would be
worth it to have a global debugging flag that essentially causes  isnan to
be called before any function applications.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20060919/42f822d6/attachment.html>


More information about the NumPy-Discussion mailing list