[SciPy-User] Strange behaviour from corrcoef when calculating correlation-matrix in SciPy/NumPy.
Raj
rajeev.raizada at gmail.com
Thu Mar 3 18:34:32 EST 2011
On Mar 3, 3:18 pm, eat <e.antero.ta... at gmail.com> wrote:
> So perhaps there does not exist any really simple and straightforward
> translation
> (of corrcoef) from matlab to numpy? Just as an example; how would you
> implement case %(3 properly with numpy?
> Regards,
> eat
It turns out that Matlab also embodies some confusion
on this front, as it turns out that Matlab has
two different functions for computing correlation!
One is corr(), which is in the Matlab Stats Toolbox.
This is the one that I have always used,
and it is better-behaved, in my opinion,
as I argue below.
The other Matlab function is corrcoef().
This is not the Stats Toolbox function, it's in the main code base.
I didn't even know that this function existed until this thread! :-)
In my view, the Matlab function corr() is the one to emulate.
It has the very desirable property that corr(m,m) and corr(m)
are the same.
Also, its behaviour when correlating two different matrices
is very reasonable:
http://www.mathworks.com/help/toolbox/stats/corr.html
RHO = corr(X,Y) returns a p1-by-p2 matrix containing the pairwise
correlation coefficient between each pair of columns in the n-by-p1
and n-by-p2 matrices X and Y.
>> m1 = [ 1 2; -1 3; 0 4]
m1 =
1 2
-1 3
0 4
>> corr(m1)
ans =
1.0000 -0.5000
-0.5000 1.0000
>> corr(m1,m1)
ans =
1.0000 -0.5000
-0.5000 1.0000
>> m2 = [ -1 1; 2 -1; -1 3]
m2 =
-1 1
2 -1
-1 3
>> corr(m1,m2)
ans =
-0.8660 0.5000
0 0.5000
In contrast, the Matlab corrcoef() does weird things,
and is almost as bad as the SciPy corrcoef() function in that regard.
>> corrcoef(m1)
ans =
1.0000 -0.5000
-0.5000 1.0000
>> corrcoef(m1,m1)
ans =
1 1
1 1
>> corrcoef(m1,m2)
ans =
1.0000 0.2125
0.2125 1.0000
So, if anything in Matlab is to be taken as a role-model,
I would advocate for the Stats Toolbox function corr().
Another argument for this corr() behavior is that
the R function cor() behaves the same way.
I guess R is the gold-standard for stats computing.
Here are the above operations in R:
> m1 <- matrix(c(1, -1, 0, 2, 3, 4),nrow=3)
> m1
[,1] [,2]
[1,] 1 2
[2,] -1 3
[3,] 0 4
> m2 <- matrix(c(-1, 2, -1, 1, -1, 3),nrow=3)
> m2
[,1] [,2]
[1,] -1 1
[2,] 2 -1
[3,] -1 3
> cor(m1)
[,1] [,2]
[1,] 1.0 -0.5
[2,] -0.5 1.0
> cor(m1,m1)
[,1] [,2]
[1,] 1.0 -0.5
[2,] -0.5 1.0
> cor(m1,m2)
[,1] [,2]
[1,] -0.8660254 0.5
[2,] 0.0000000 0.5
In summary, let's copy R's cor() and Matlab's corr(),
not Matlab's corrcoef().
Raj
More information about the SciPy-User
mailing list