numpy/scipy: correlation

Robert Kern robert.kern at gmail.com
Sun Nov 12 18:10:18 EST 2006


robert wrote:

> I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore. 

There is no such thing as "a formula for an error range" in a vacuum like that.
Each formula has a model attached to it. If your data does not follow that
model, then any such estimate of the error of your estimate is meaningless.

sturlamolden pointed out to me that I was wrong in thinking that the correlation
coefficient was meaningless in the linear regression case. However, the error
estimate that is used for bivariate normal correlation coefficients will almost
certainly not apply to "correlation coefficient qua sqare root of the fraction
of y's variance explained by the linear model". In that context, the
"correlation coefficient" is not an estimate of some parameter. It has no
uncertainty attached to it.

> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
> says:
>   In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.
> 
> Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

scipy.stats.pearsonr()

As with all such frequentist hypothesis testing nonsense, one takes the null
hypothesis (in this case, a bivariate normal distribution of points with 0
correlation), finds the distribution of the test statistic given the number of
points sampled, and then finds the probability of getting a test statistic "at
least as extreme" as the test statistic you actually got.

> my old func is simply hands-on based on 
> n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
> Guess its already fast for large data?

You really want to use a 2-pass algorithm to avoid numerical problems.

> Note: numpy.corrcoef strikes on 2 points:
>>>> numpy.corrcoef(([1,2],[1,2]))
> array([[          -1.#IND,           -1.#IND],
>        [          -1.#IND,           -1.#IND]])

This was fixed in SVN.

> PPS:
> 
> A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.

Well, go ahead and poke the people that said they would have a Windows binary
ready. Or provide one yourself.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list