[SciPy-User] Strange behaviour from corrcoef when calculating correlation-matrix in SciPy/NumPy.

Thu Mar 3 18:34:32 EST 2011

On Mar 3, 3:18 pm, eat <e.antero.ta... at gmail.com> wrote:
> So perhaps there does not exist any really simple and straightforward
> translation
> (of corrcoef) from matlab to numpy? Just as an example; how would you
> implement case %(3 properly  with numpy?
> Regards,
> eat

It turns out that Matlab also embodies some confusion
on this front, as it turns out that Matlab has
two different functions for computing correlation!

One is corr(), which is in the Matlab Stats Toolbox.
This is the one that I have always used,
and it is better-behaved, in my opinion,
as I argue below.

The other Matlab function is corrcoef().
This is not the Stats Toolbox function, it's in the main code base.
I didn't even know that this function existed until this thread!  :-)

In my view, the Matlab function corr() is the one to emulate.
It has the very desirable property that corr(m,m) and corr(m)
are the same.

Also, its behaviour when correlating two different matrices
is very reasonable:
http://www.mathworks.com/help/toolbox/stats/corr.html
RHO = corr(X,Y) returns a p1-by-p2 matrix containing the pairwise
correlation coefficient between each pair of columns in the n-by-p1
and n-by-p2 matrices X and Y.

>> m1 = [ 1 2; -1 3; 0 4]
m1 =
     1     2
    -1     3
     0     4

>> corr(m1)
ans =
    1.0000   -0.5000
   -0.5000    1.0000

>> corr(m1,m1)
ans =
    1.0000   -0.5000
   -0.5000    1.0000

>> m2 = [ -1 1; 2 -1; -1 3]
m2 =
    -1     1
     2    -1
    -1     3

>> corr(m1,m2)
ans =
   -0.8660    0.5000
         0    0.5000

In contrast, the Matlab corrcoef() does weird things,
and is almost as bad as the SciPy corrcoef() function in that regard.

>> corrcoef(m1)
ans =
    1.0000   -0.5000
   -0.5000    1.0000

>> corrcoef(m1,m1)
ans =
     1     1
     1     1

>> corrcoef(m1,m2)
ans =
    1.0000    0.2125
    0.2125    1.0000

So, if anything in Matlab is to be taken as a role-model,
I would advocate for the Stats Toolbox function corr().

Another argument for this corr() behavior is that
the R function cor() behaves the same way.
I guess R is the gold-standard for stats computing.

Here are the above operations in R:

> m1 <- matrix(c(1, -1, 0, 2, 3, 4),nrow=3)
> m1
     [,1] [,2]
[1,]    1    2
[2,]   -1    3
[3,]    0    4

> m2 <- matrix(c(-1, 2, -1, 1, -1, 3),nrow=3)
> m2
     [,1] [,2]
[1,]   -1    1
[2,]    2   -1
[3,]   -1    3

> cor(m1)
     [,1] [,2]
[1,]  1.0 -0.5
[2,] -0.5  1.0

> cor(m1,m1)
     [,1] [,2]
[1,]  1.0 -0.5
[2,] -0.5  1.0

> cor(m1,m2)
           [,1] [,2]
[1,] -0.8660254  0.5
[2,]  0.0000000  0.5

In summary, let's copy R's cor() and Matlab's corr(),
not Matlab's corrcoef().

Raj