[Numpy-discussion] Coverting ranks to a Gaussian

Keith Goodman kwgoodman at gmail.com
Tue Jun 10 09:17:01 EDT 2008


On Tue, Jun 10, 2008 at 12:56 AM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> 2008/6/9 Keith Goodman <kwgoodman at gmail.com>:
>> Does anyone have a function that converts ranks into a Gaussian?
>>
>> I have an array x:
>>
>>>> import numpy as np
>>>> x = np.random.rand(5)
>>
>> I rank it:
>>
>>>> x = x.argsort().argsort()
>>>> x_ranked = x.argsort().argsort()
>>>> x_ranked
>>   array([3, 1, 4, 2, 0])
>>
>> I would like to convert the ranks to a Gaussian without using scipy.
>> So instead of the equal distance between ranks in array x, I would
>> like the distance been them to follow a Gaussian distribution.
>>
>> How far out in the tails of the Gaussian should 0 and N-1 (N=5 in the
>> example above) be? Ideally, or arbitrarily, the areas under the
>> Gaussian to the left of 0 (and the right of N-1) should be 1/N or
>> 1/2N. Something like that. Or a fixed value is good too.
>
> I'm actually not clear on what you need.
>
> If what you need is for rank i of N to be the 100*i/N th percentile in
> a Gaussian distribution, then you should indeed use scipy's functions
> to accomplish that; I'd use scipy.stats.norm.ppf().
>
> Of course, if your points were drawn from a Gaussian distribution,
> they wouldn't be exactly 1/N apart, there would be some distribution.
> Quite what the distribution of (say) the maximum or the median of N
> points drawn from a Gaussian is, I can't say, though people have
> looked at it. But if you want "typical" values, just generate N points
> from a Gaussian and sort them:
>
> V = np.random.randn(N)
> V = np.sort(V)
>
> return V[ranks]
>
> Of course they will be different every time, but the distribution will be right.

I guess I botched the description of my problem.

I have data that contains outliers and other noise. I am trying
various transformations of the data to preprocess it before plugging
it into my prediction algorithm. One such transformation is to rank
the data and then convert that rank to a Gaussian. The particular
details of the transformation don't matter. I just want something
smooth and normal like.

> Anne
> P.S. why the "no scipy" restriction? it's a bit unreasonable. -A

I'd rather not pull in a scipy dependency for one function if there is
a numpy alternative. I think it is funny that you picked up on my
brief mention of scipy and called it unreasonable.



More information about the NumPy-Discussion mailing list