[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Nathaniel Smith njs at pobox.com
Thu Sep 26 18:42:02 EDT 2013


On 26 Sep 2013 21:59, "Faraz Mirzaei" <fmmirzaei at gmail.com> wrote:
>
> Thanks Josef and Nathaniel for your responses.
>
> In the application that I have, I don't use the correlation coefficient
matrix as a whole (so I don't care if it is PSD or not). I simply read the
off-diagonal elements for pair-wise correlation coefficients. I use the
pairwise correlation coefficient to test if the data from various sources
(i.e., rows of the matrix), agree with each other when present.
>
> Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the
off-diagonal element in a loop over i and j. It is just a bit uglier than
calling ma.corrcoef(x).
>
> At least for my application, truncation to -1 or +1 (or scaling such that
largest values becomes 1 etc) is completely wrong, since it would imply
that the two sources completely agree with each other (factoring out a
minus sign), which may not the case. For example, consider the first and
last rows of the example I provided:
>
> >>> print x_ma
> [[ 7 -4  -1  --  -3  -2]
>  [ 6 -3   0  4   0   5]
>  [-4  --   7  5  --   --]
>  [--   5   --  0   1  4]]
>
> >>> np.ma.corrcoef(x_ma)[0,3]
> -1.6813456149534147
>
>
> On the other hand, if we supply only the first and third row to the
function, we get:
>
> >>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:])
> masked_array(data =
>  [[1.0 -0.240192230708]
>  [-0.240192230708 1.0]],
>              mask =
>  [[False False]
>  [False False]],
>        fill_value = 1e+20)
>
> Interestingly, this is the same as what pandas results as the [3,0]
element of the correlation coefficient matrix, and it is equal to pair-wise
deletion result:
>
> >>> np.corrcoef([-4, -3, -2], [5, 1, 4])  #Note that this is NOT
np.ma.corrcoef
> >>>
> array([[ 1.        , -0.24019223],
>        [-0.24019223,  1.        ]])
>
>
> Also, I don't know why the ma.corrcoef results Josef has mentioned are
different than mine. In particular, Josef reports element [2, 0] of the
ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked,
probably due to too few samples available). Josef: are you sure that you
have entered the example values correctly into python? Along the same
lines, the results that Nathaniel has posted from R are different, since
the input is not a masked matrix I guess (please note that in the original
example, I had masked values less than or equal to -5).

Yes, sorry, this is just a cut and paste error - in fact the result I
posted is what R gives for the stay with values <= -5 replaced by NA, but I
left this line out of the email.

I think the only difference is that R and pandas give a correlation of 1.0
when there are only 1 or 2 data points, and ma.corrcoef returns masked in
this case. Not sure which makes more sense.

>
> In any case, I think the correlation coefficient between two rows of a
matrix should not depend on what other rows are supplied. In other
words, np.ma.corrcoef(x_ma)[0,3] should be equal to
np.ma.corrcoef(x_ma[0,:], x_ma[3,:])[0,1] (which apparently happens to be
what pandas reports).
>
> This change would need recomputing the mean for every pair-wise
coefficient calculation, but since we are computing cross products O(n^2)
times, the overall big-O complexity won't change.
>
> And please don't remove this functionality. I will volunteer to fix it
however we decide :) We can just clarify the behavior in the documentation.

In the long run I prefer R's behaviour of requiring the user to specify
before skipping anything, but I tend to agree that in the short term
pairwise deletion is what ma.corrcoef users expect and what we should do.
Maybe you could implement the fix and we could move the discussion to the
PR?

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130926/a9bf328c/attachment.html>


More information about the NumPy-Discussion mailing list