numpy/scipy: correlation

Sun Nov 12 09:52:45 EST 2006

Robert Kern wrote:
> robert wrote:
>> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: 
>> m, m-err, o, o-err, r-coef,r-coef-err ?
> 
> And of course, those three parameters are not particularly meaningful together.
> If your model is truly "y is a linear response given x with normal noise" then
> "y=m*x+o" is correct, and all of the information that you can get from the data
> will be found in the estimates of m and o and the covariance matrix of the
> estimates.
> 
> On the other hand, if your model is that "(x, y) is distributed as a bivariate
> normal distribution" then "y=m*x+o" is not a particularly good representation of
> the model. You should instead estimate the mean vector and covariance matrix of
> (x, y). Your correlation coefficient will be the off-diagonal term after
> dividing out the marginal standard deviations.
> 
> The difference between the two models is that the first places no restrictions
> on the distribution of x. The second does; both the x and y marginal
> distributions need to be normal. Under the first model, the correlation
> coefficient has no meaning.

Think the difference is little in practice - when you head for usable diagonals. 
Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. )

Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points.
With big coef's and lots of distributed data the coef is very good by itself  - its error range err(N) only approx ~ 1/sqrt(N)

One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all. 
More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not?  Thus lower coefs require naturally a coef-err to be useful in practice.

Now think of adding 'boring data':

>>> X=[1.,2,3,4]
>>> Y=[1.,2,3,5]
>>> sd.correlation((X,Y))     # my old func 
(1.3, -0.5, 0.982707629824)   # m,o,coef
>>> numpy.corrcoef((X,Y))
array([[ 1.        ,  0.98270763],
       [ 0.98270763,  1.        ]])
>>> XX=[1.,1,1,1,1,2,3,4]
>>> YY=[1.,1,1,1,1,2,3,5]
>>> sd.correlation((XX,YY))
(1.23684210526, -0.289473684211, 0.988433774639)
>>> 

I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...

I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore. 

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
says:
  In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y
tells:  

probable error of r = 0.6745*(1-r**2)/sqrt(N)

A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from 

0.0115628571429 to 0.00548453410954 !

And the absolute size of this error measure seems to be too low for just 4 points of data!

The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...

Robert

PS:

my old func is simply hands-on based on 
n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
Guess its already fast for large data?

Note: numpy.corrcoef strikes on 2 points:
>>> numpy.corrcoef(([1,2],[1,2]))
array([[          -1.#IND,           -1.#IND],
       [          -1.#IND,           -1.#IND]])
>>> sd.correlation(([1,2],[1,2]))
(1, 0, 1.0)
>>> 
>>> numpy.corrcoef(([1,2,3],[1,2,3]))
array([[ 1.,  1.],
       [ 1.,  1.]])
>>> sd.correlation(([1,2,3],[1,2,3]))
(1, 0, 1.0)

PPS:

A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.