numpy/scipy: correlation

Sun Nov 12 18:16:25 EST 2006

robert wrote:

> > t = r * sqrt( (n-2)/(1-r**2) )

> yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
> that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?

I gave you th formula. Solve for r and you get the confidence interval.
You will need to use the inverse cumulative Student t distribution.

Another quick-and-dirty solution is to use bootstrapping.

from numpy import mean, std, sum, sqrt, sort
from numpy.random import randint

def bootstrap_correlation(x,y):
   idx = randint(len(x),size=(1000,len(x)))
   bx = x[idx] # reasmples x with replacement
   by = y[idx] # resamples y with replacement
   mx = mean(bx,1)
   my = mean(by,1)
   sx = std(bx,1)
   sy = std(by,1)
   r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
      (by - my.repeat(len(y),0).reshape(by.shape)), 1) /
((len(x)-1)*sx*sy))
   #bootstrap confidence interval (NB! biased)
   return (r[25],r[975])

> My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line.

The bootstrap is "non-parametric" in the sense that it is distribution
free.