numpy/scipy: correlation

Sun Nov 12 18:41:39 EST 2006

While I am at it, lets add the bootstrap estimate of the standard error
as well.

from numpy import mean, std, sum, sqrt, sort, corrcoef, tanh, arctanh
from numpy.random import randint

def bootstrap_correlation(x,y):
   idx = randint(len(x),size=(1000,len(x)))
   bx = x[idx]
   by = y[idx]
   mx = mean(bx,1)
   my = mean(by,1)
   sx = std(bx,1)
   sy = std(by,1)
   r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
       (by - my.repeat(len(y),0).reshape(by.shape)), 1)
        / ((len(x)-1)*sx*sy))
   #bootstrap confidence interval (NB! biased)
   conf_interval = (r[25],r[975])
   #bootstrap standard error using Fisher's z-transform  (NB! biased)
   std_err = tanh(std(arctanh(r))*(len(r)/(len(r)-1.0)))
   return (std_err, conf_interval)

Since we are not using bias corrected and accelerated bootstrap, the
standard error are more meaningful, although both estimates are biased.

I said that the boostrap is "distribution free". That is not quite the
case either. The assumed distribution is the provided sample
distribution, i.e. a sum of Dirac delta functions. Often this is not a
very sound assumption, and the bootstrap must then be improved e.g.
using the BCA procedure. That is particularly important for confidence
intervals, but not so much for standard errors. However for a
quick-and-dirty standard error estimate, we can just ignore the small
bias.

Now, if you wonder why I used the standard deviation to obtain the
standard error (without dividing by sqrt(n) ), the answer is that the
standard error is the standard deviation of an estimator. As we have
1000 samples from the correlation's sampling distribution in r, we get
the "standard error of the correlation" by taking the standard
deviation of r. The z-transform is required before we can compute mean
and standard deviations of correlations.