[SciPy-User] PCA functions

Zachary Pincus zachary.pincus at yale.edu
Wed May 19 11:31:20 EDT 2010


Hi Mike,

Here's what I use. I don't think there's anything in scipy per se, but  
I might be wrong.

Empirically, I find that doing the PCA with eigh is faster than with  
svd, but this might be based on the dimensionality of my data vs. the  
number of data points I use. The functions take in an (m,n)-shaped  
matrix of m data points in n dimensions, and return an array of shape  
(k,n) consisting of k principal components in n dimensions (where  
k=min(m,n)), a (k,)-shaped array of the variances of the data along  
each principal component, and a (m,k)-shaped array of the projection  
of each data point into the subspace spanned by the principal  
components.

Zach



import numpy

def pca_svd(flat):
   u, s, vt = numpy.linalg.svd(flat, full_matrices = 0)
   pcs = vt
   v = numpy.transpose(vt)
   data_count = len(flat)
   variances = s**2 / data_count
   positions =  u * s
   return pcs, variances, positions

def pca_eig(flat):
   values, vectors = _symm_eig(flat)
   pcs = vectors.transpose()
   variances = values / len(flat)
   positions = numpy.dot(flat, vectors)
   return pcs, variances, positions

def _symm_eig(a):
   """Given input a, return the non-zero eigenvectors and eigenvalues  
of the symmetric matrix a'a.

   If a has more columns than rows, then that matrix will be rank- 
deficient,
   and the non-zero eigenvalues and eigenvectors of a'a can be more  
easily extracted
   from the matrix aa'. From the properties of the SVD:
     if a of shape (m,n) has SVD u*s*v', then:
       a'a = v*s's*v'
       aa' = u*ss'*u'
     let s_hat, an array of shape (m,n), be such that s * s_hat = I(m,m)
     and s_hat * s = I(n,n). Thus, we can solve for u or v in terms of  
the other:
       v = a'*u*s_hat'
       u = a*v*s_hat
   """
   m, n = a.shape
   if m >= n:
     # just return the eigenvalues and eigenvectors of a'a
     vecs, vals = _eigh(numpy.dot(a.transpose(), a))
     vecs = numpy.where(vecs < 0, 0, vecs)
     return vecs, vals
   else:
     # figure out the eigenvalues and vectors based on aa', which is  
smaller
     sst_diag, u = _eigh(numpy.dot(a, a.transpose()))
     # in case due to numerical instabilities we have sst_diag < 0  
anywhere,
     # peg them to zero
     sst_diag = numpy.where(sst_diag < 0, 0, sst_diag)
     # now get the inverse square root of the diagonal, which will  
form the
     # main diagonal of s_hat
     err = numpy.seterr(divide='ignore', invalid='ignore')
     s_hat_diag = 1/numpy.sqrt(sst_diag)
     numpy.seterr(**err)
     s_hat_diag = numpy.where(numpy.isfinite(s_hat_diag), s_hat_diag, 0)
     # s_hat_diag is a list of length m, a'u is (n,m), so we can just  
use
     # numpy's broadcasting instead of matrix multiplication, and only  
create
     # the upper mxm block of a'u, since that's all we'll use anyway...
     v = numpy.dot(a.transpose(), u[:,:m]) * s_hat_diag
     return sst_diag, v

def _eigh(m):
   values, vectors = numpy.linalg.eigh(m)
   order = numpy.flipud(values.argsort())
   return values[order], vectors[:,order]






On May 19, 2010, at 10:53 AM, Michael Hull wrote:

> Hi Everybody,
> I am doing some work using numpy/scipy and wanted to find the
> principle components for some data. I can write a fairly simple
> function to do this, but was wondering if there was already a function
> in scipy to do this that I hadn't found before re-inventing the wheel
>
> Many thanks,
>
>
> Mike Hull
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user




More information about the SciPy-User mailing list