numpy/scipy: correlation

Sun Nov 12 16:46:25 EST 2006

On 11/12/06, robert <no-spam at no-spam-no-spam.invalid> wrote:
> Robert Kern wrote:
> > robert wrote:

(...)
> One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
> More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not?  Thus lower coefs require naturally a coef-err to be useful in practice.
>

(...)

> I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
>
(...)
>
> Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.
>
>

There is a transformation of the correlation coefficient that is
distributed as a t-statistic under the null. This was derived a long
way back, and is the usual, standard, way to test for significance of
(attach a p-value to) correlation coefficients.  I do not recall the
formula from top of my head, but I am sure any google search will find
it. And of course, any decent intro stats textbook will also provide
it. You can look in your nearby library for popular textbooks such as
Snedecor & Cochran, or Sokal & Rohlf, or the wonderful "Beyond Anova"
by Miller, which do have the expression.

Since this is such a common procedure all stats programs do provide
for tests of significance of correlation coefficients. A whole bunch
of them are free, and one is widely used: R (check
http://cran.r-project.org). This is an easy way for you to test the
output of your stats code in Python.

Now, since you can do a significance test, you can invert the
procedure and obtain a confidence interval (of whichever width you
want) for your correlation coefficients. This CI will (almost always)
be assymmetric. (And recall that CI do have a not-obvious
interpretation). CI of correlation coefficients are not as common as
p-values, but this is doable too.

The expression (and "improved versions" of it, I think Sokal & Rohlf
show several) is easy to compute. And I bet obtaining the density of a
t-distribution with k d.f. is also available already for Python. So
this part is definitely doable.

Best,

R.

(...)
> Now think of adding 'boring data':
>
> >>> X=[1.,2,3,4]
> >>> Y=[1.,2,3,5]
> >>> sd.correlation((X,Y))     # my old func
> (1.3, -0.5, 0.982707629824)   # m,o,coef
> >>> numpy.corrcoef((X,Y))
> array([[ 1.        ,  0.98270763],
>        [ 0.98270763,  1.        ]])
> >>> XX=[1.,1,1,1,1,2,3,4]
> >>> YY=[1.,1,1,1,1,2,3,5]
> >>> sd.correlation((XX,YY))
> (1.23684210526, -0.289473684211, 0.988433774639)
> >>>
>
> I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
>
> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
> says:
>   In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.
http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y
> tells:
>
> probable error of r = 0.6745*(1-r**2)/sqrt(N)
>
> A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from
>
> 0.0115628571429 to 0.00548453410954 !
>
> And the absolute size of this error measure seems to be too low for just 4 points of data!
>
> The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
>
>
> Robert
>
>
>
> PS:
>
> my old func is simply hands-on based on
> n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
> Guess its already fast for large data?
>
>
> Note: numpy.corrcoef strikes on 2 points:
> >>> numpy.corrcoef(([1,2],[1,2]))
> array([[          -1.#IND,           -1.#IND],
>        [          -1.#IND,           -1.#IND]])
> >>> sd.correlation(([1,2],[1,2]))
> (1, 0, 1.0)
> >>>
> >>> numpy.corrcoef(([1,2,3],[1,2,3]))
> array([[ 1.,  1.],
>        [ 1.,  1.]])
> >>> sd.correlation(([1,2,3],[1,2,3]))
> (1, 0, 1.0)
>
>
> PPS:
>
> A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
> --
> http://mail.python.org/mailman/listinfo/python-list
>

-- 
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz