[SciPy-User] bug in rankdata?

Wes McKinney wesmckinn at gmail.com
Sat Feb 16 13:58:51 EST 2013


On Sat, Feb 16, 2013 at 1:22 PM, Warren Weckesser
<warren.weckesser at gmail.com> wrote:
> On 2/15/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>> Thanks very much! I discovered this bug because mann-whitney U was
>> giving me bizarre results, like a negative U statistic. My data is a
>> large number of integer counts, mostly zeros, which is the worst case
>> for ties.
>>
>> Until I can update scipy, I'll either write my own rankdata method,
>> which will be very slow, or I'll use the R equivalent which is more
>> feature-ful (but then I have to figure out rpy2 which will also be
>> slow).
>
>
> You could also try pandas (http://pandas.pydata.org/).  The DataFrame
> and Series classes have a 'rank' method
> (http://pandas.pydata.org/pandas-docs/stable/computation.html#data-ranking).
>
> Warren
>

I will add that the pandas rank methods are approx 20x faster than
scipy.stats.rankdata, so another reason to use them for large-ish
arrays.

- Wes

>
>>
>> On Fri, Feb 15, 2013 at 10:22 AM, Warren Weckesser
>> <warren.weckesser at gmail.com> wrote:
>>>
>>>
>>> On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser
>>> <warren.weckesser at gmail.com> wrote:
>>>>
>>>> On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>>>> > The results I'm getting from rankdata seem completely wrong for large
>>>> > datasets. I'll illustrate with a case where all data are equal, so
>>>> > every rank should be len(data) / 2 + 0.5.
>>>> >
>>>> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
>>>> > Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
>>>> > 5000.5])
>>>> >
>>>> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
>>>> > Out[221]:
>>>> > array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
>>>> >         7050.82704,  7050.82704])
>>>> >
>>>> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
>>>> > Out[222]:
>>>> > array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
>>>> >         1784.293664,  1784.293664])
>>>> >
>>>> > In [223]: scipy.__version__
>>>> > Out[223]: '0.11.0'
>>>> >
>>>> > In [224]: numpy.__version__
>>>> > Out[224]: '1.6.1'
>>>> >
>>>> >
>>>> > The results are completely off for N>10000 or so. Am I doing something
>>>> > wrong?
>>>>
>>>>
>>>> Looks like a bug.  The code that accumulates the ranks of the tied
>>>> values is using a 32 bit integer for the sum of the ranks, and this is
>>>> overflowing.  I'll see if I can get this fixed for the imminent
>>>> release of 0.12.
>>>>
>>>> Warren
>>>>
>>>
>>>
>>> A pull  request with the fix is here:
>>> https://github.com/scipy/scipy/pull/436
>>>
>>>
>>> Warren
>>>
>>>
>>>>
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > SciPy-User at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user



More information about the SciPy-User mailing list