[SciPy-User] bug in rankdata?

Fri Feb 15 13:22:09 EST 2013

On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser <
warren.weckesser at gmail.com> wrote:

> On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
> > The results I'm getting from rankdata seem completely wrong for large
> > datasets. I'll illustrate with a case where all data are equal, so
> > every rank should be len(data) / 2 + 0.5.
> >
> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
> > Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
> > 5000.5])
> >
> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
> > Out[221]:
> > array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
> >         7050.82704,  7050.82704])
> >
> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
> > Out[222]:
> > array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
> >         1784.293664,  1784.293664])
> >
> > In [223]: scipy.__version__
> > Out[223]: '0.11.0'
> >
> > In [224]: numpy.__version__
> > Out[224]: '1.6.1'
> >
> >
> > The results are completely off for N>10000 or so. Am I doing something
> > wrong?
>
>
> Looks like a bug.  The code that accumulates the ranks of the tied
> values is using a 32 bit integer for the sum of the ranks, and this is
> overflowing.  I'll see if I can get this fixed for the imminent
> release of 0.12.
>
> Warren
>
>

A pull  request with the fix is here:
https://github.com/scipy/scipy/pull/436

Warren

> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20130215/e89f414f/attachment.html>