[SciPy-User] calculating covariances fast (and accurate)
Sturla Molden
sturla at molden.no
Thu Jul 8 19:30:15 EDT 2010
I needed to calculate covariances with rounding error correction. As
SciPy expose BLAS, I could use dger and dgemm. Here is the code:
import numpy as np
import scipy as sp
import scipy.linalg
from scipy.linalg.fblas import dger, dgemm
def mean_cov(X):
n,p = X.shape
m = X.mean(axis=0)
# covariance matrix with correction for rounding error
# S = (cx'*cx - (scx'*scx/n))/(n-1)
# Am Stat 1983, vol 37: 242-247.
cx = X - m
scx = cx.sum(axis=0)
scx_op = dger(-1.0/n,scx,scx)
S = dgemm(1.0, cx.T, cx.T, beta=1.0,
c=scx_op, trans_a=0, trans_b=1, overwrite_c=1)
S[:] *= 1.0/(n-1)
return m,S.T
Let's time this couple of times against NumPy:
if __name__ == '__main__':
from time import clock
n,p = 20000,1000
X = 2*np.random.randn(n,p) + 5
t0 = clock()
m,S = X.mean(axis=0), np.cov(X, rowvar=False, bias=0)
t1 = clock()
print t1-t0
t0 = clock()
m,S = mean_cov(X)
t1 = clock()
print t1-t0
L:\>meancov.py
7.39771102515
2.24604790004
L:\>meancov.py
16.1079984658
2.21100101726
That speaks for itself :-D
One important lesson from this: it does not help to write a function
like cov(X) in C, when we have access to optimized BLAS from Python. So
let this serve as a warning against using C istead of Python.
If we don't want to correct rounding error, dger goes out and it just
becomes:
def mean_cov(X):
n,p = X.shape
m = X.mean(axis=0)
cx = X - m
S = dgemm(1./(n-1), cx.T, cx.T, trans_a=0, trans_b=1)
return m,S.T
Sturla
P.S. I should perhaps mention that I use SciPy and Numpy linked with
MKL. Looking at Windows' task manager, it seems both np.cov and my code
saturate all four cores.
More information about the SciPy-User
mailing list