[SciPy-User] calculating covariances fast (and accurate)

Thu Jul 8 19:30:15 EDT 2010

I needed to calculate covariances with rounding error correction. As 
SciPy expose BLAS, I could use dger and dgemm. Here is the code:

import numpy as np
import scipy as sp
import scipy.linalg
from scipy.linalg.fblas import dger, dgemm

def mean_cov(X):
    n,p = X.shape
    m = X.mean(axis=0)
    # covariance matrix with correction for rounding error
    # S = (cx'*cx - (scx'*scx/n))/(n-1)
    # Am Stat 1983, vol 37: 242-247.
    cx = X - m
    scx = cx.sum(axis=0)
    scx_op = dger(-1.0/n,scx,scx)
    S = dgemm(1.0, cx.T, cx.T, beta=1.0,
            c=scx_op, trans_a=0, trans_b=1, overwrite_c=1)
    S[:] *= 1.0/(n-1)
    return m,S.T

Let's time this couple of times against NumPy:

if __name__ == '__main__':

    from time import clock

    n,p = 20000,1000
    X = 2*np.random.randn(n,p) + 5

    t0 = clock()
    m,S = X.mean(axis=0), np.cov(X, rowvar=False, bias=0)
    t1 = clock()
    print t1-t0

    t0 = clock()
    m,S = mean_cov(X)
    t1 = clock()
    print t1-t0

L:\>meancov.py
7.39771102515
2.24604790004

L:\>meancov.py
16.1079984658
2.21100101726

That speaks for itself :-D

One important lesson from this: it does not help to write a function 
like cov(X) in C, when we have access to optimized BLAS from Python. So 
let this serve as a warning against using C istead of Python.

If we don't want to correct rounding error, dger goes out and it just 
becomes:

def mean_cov(X):
    n,p = X.shape
    m = X.mean(axis=0)
    cx = X - m
    S = dgemm(1./(n-1), cx.T, cx.T, trans_a=0, trans_b=1)
    return m,S.T

Sturla

P.S. I should perhaps mention that I use SciPy and Numpy linked with 
MKL. Looking at Windows' task manager, it seems both np.cov and my code 
saturate all four cores.