[SciPy-User] bug in rankdata?
Warren Weckesser
warren.weckesser at gmail.com
Fri Feb 15 10:32:09 EST 2013
On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
> The results I'm getting from rankdata seem completely wrong for large
> datasets. I'll illustrate with a case where all data are equal, so
> every rank should be len(data) / 2 + 0.5.
>
> In [220]: rankdata(np.ones((10000,), dtype=np.int))
> Out[220]: array([ 5000.5, 5000.5, 5000.5, ..., 5000.5, 5000.5,
> 5000.5])
>
> In [221]: rankdata(np.ones((100000,), dtype=np.int))
> Out[221]:
> array([ 7050.82704, 7050.82704, 7050.82704, ..., 7050.82704,
> 7050.82704, 7050.82704])
>
> In [222]: rankdata(np.ones((1000000,), dtype=np.int))
> Out[222]:
> array([ 1784.293664, 1784.293664, 1784.293664, ..., 1784.293664,
> 1784.293664, 1784.293664])
>
> In [223]: scipy.__version__
> Out[223]: '0.11.0'
>
> In [224]: numpy.__version__
> Out[224]: '1.6.1'
>
>
> The results are completely off for N>10000 or so. Am I doing something
> wrong?
Looks like a bug. The code that accumulates the ranks of the tied
values is using a 32 bit integer for the sum of the ranks, and this is
overflowing. I'll see if I can get this fixed for the imminent
release of 0.12.
Warren
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list