[SciPy-User] bug in rankdata?

Fri Feb 15 10:32:09 EST 2013

On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
> The results I'm getting from rankdata seem completely wrong for large
> datasets. I'll illustrate with a case where all data are equal, so
> every rank should be len(data) / 2 + 0.5.
>
> In [220]: rankdata(np.ones((10000,), dtype=np.int))
> Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
> 5000.5])
>
> In [221]: rankdata(np.ones((100000,), dtype=np.int))
> Out[221]:
> array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
>         7050.82704,  7050.82704])
>
> In [222]: rankdata(np.ones((1000000,), dtype=np.int))
> Out[222]:
> array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
>         1784.293664,  1784.293664])
>
> In [223]: scipy.__version__
> Out[223]: '0.11.0'
>
> In [224]: numpy.__version__
> Out[224]: '1.6.1'
>
>
> The results are completely off for N>10000 or so. Am I doing something
> wrong?

Looks like a bug.  The code that accumulates the ranks of the tied
values is using a 32 bit integer for the sum of the ranks, and this is
overflowing.  I'll see if I can get this fixed for the imminent
release of 0.12.

Warren

> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>