[Numpy-discussion] Coverting ranks to a Gaussian

Mon Jun 9 23:09:26 EDT 2008

On Mon, Jun 9, 2008 at 7:35 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Mon, Jun 9, 2008 at 21:06, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>> On Mon, Jun 9, 2008 at 18:34, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> Does anyone have a function that converts ranks into a Gaussian?
>>>>
>>>> I have an array x:
>>>>
>>>>>> import numpy as np
>>>>>> x = np.random.rand(5)
>>>>
>>>> I rank it:
>>>>
>>>>>> x = x.argsort().argsort()
>>>>>> x_ranked = x.argsort().argsort()
>>>>>> x_ranked
>>>>   array([3, 1, 4, 2, 0])
>>>
>>> There are subtleties in computing ranks when ties are involved. Take a
>>> look at the implementation of scipy.stats.rankdata().
>>
>> Good point. I had to deal with ties and missing data. I bet
>> scipy.stats.rankdata() is faster than my implementation.
>
> Actually, it's pretty slow. I think there are opportunities to make it
> faster, but I haven't explored them.
>
>>>> I would like to convert the ranks to a Gaussian without using scipy.
>>>
>>> No dice. You are going to have to use scipy.special.ndtri somewhere. A
>>> basic transformation (off the top of my head, I have no idea if this
>>> is statistically meaningful):
>>>
>>>  scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0))
>>>
>>> Barring tied first or last items, this should give equal weight to
>>> each of the tails outside of the range of your data.
>>
>> Nice. Thank you. It passes the never wrong chi-by-eye test:
>>
>> r = np.arange(1000)
>> g = special.ndtri((r + 1.0) / (len(r) + 1.0))
>> pylab.hist(g, 50)
>> pylab.show()
>
> BTW, what is your use case? If you are trying to compare your data to
> the normal distribution, you might want to go the other way: find the
> empirical quantiles of your data and compare them against the
> hypothetical quantiles of the data on the distribution. This *is* an
> established statistical technique; when you graph one versus the
> other, you get what is known as a Q-Q plot.

Some of the techniques I use are sensitive to noise. So to make things
more robust I sometimes transform my noisy, sort-of-normal data.
There's no theory to guide me, only empirical results.