[Numpy-discussion] corrcoef of masked array

Wed May 30 06:02:14 EDT 2007

On Friday 25 May 2007 19:18, Robert Kern wrote:
> Jesper Larsen wrote:
> > Hi numpy users,
> >
> > I have a masked array of dimension (nvariables, nobservations) that
> > contain missing values at arbitrary points. Is it safe to rely on
> > numpy.corrcoeff to calculate the correlation coefficients of a masked
> > array (it seems to give reasonable results)?
>
> No, it isn't. There are several different options for estimating
> correlations in the face of missing data, none of which are clearly
> superior to the others. None of them are trivial. None of them are
> implemented in numpy.

Thanks, my previous post was sent a bit too early since it became clear to me 
by reading the code for corrcoef that it is not safe for use with masked 
arrays.

Here is my solution for calculating the correlation coefficients for masked 
arrays. Comments are appreciated:

def macorrcoef(data1, data2):
  """
  Calculates correlation coefficients taking masked out values
  into account.

  It is assumed (but not checked) that data1.shape == data2.shape.
  """
  nv, no = data1.shape
  cc = ma.array(0., mask=ones((nv, nv)))
  if no > 1:
    for i in range(nv):
      for j in range(nv):
        m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:])
        d1 = ma.array(data1[i,:], copy=False, mask=m).compressed()
        d2 = ma.array(data2[j,:], copy=False, mask=m).compressed()
        if ma.count(d1) > 1:
          c = corrcoef(d1, d2)
          cc[i,j] = c[0,1]

  return cc

- Jesper