[SciPy-User] PCA functions
Zachary Pincus
zachary.pincus at yale.edu
Wed May 19 11:31:20 EDT 2010
Hi Mike,
Here's what I use. I don't think there's anything in scipy per se, but
I might be wrong.
Empirically, I find that doing the PCA with eigh is faster than with
svd, but this might be based on the dimensionality of my data vs. the
number of data points I use. The functions take in an (m,n)-shaped
matrix of m data points in n dimensions, and return an array of shape
(k,n) consisting of k principal components in n dimensions (where
k=min(m,n)), a (k,)-shaped array of the variances of the data along
each principal component, and a (m,k)-shaped array of the projection
of each data point into the subspace spanned by the principal
components.
Zach
import numpy
def pca_svd(flat):
u, s, vt = numpy.linalg.svd(flat, full_matrices = 0)
pcs = vt
v = numpy.transpose(vt)
data_count = len(flat)
variances = s**2 / data_count
positions = u * s
return pcs, variances, positions
def pca_eig(flat):
values, vectors = _symm_eig(flat)
pcs = vectors.transpose()
variances = values / len(flat)
positions = numpy.dot(flat, vectors)
return pcs, variances, positions
def _symm_eig(a):
"""Given input a, return the non-zero eigenvectors and eigenvalues
of the symmetric matrix a'a.
If a has more columns than rows, then that matrix will be rank-
deficient,
and the non-zero eigenvalues and eigenvectors of a'a can be more
easily extracted
from the matrix aa'. From the properties of the SVD:
if a of shape (m,n) has SVD u*s*v', then:
a'a = v*s's*v'
aa' = u*ss'*u'
let s_hat, an array of shape (m,n), be such that s * s_hat = I(m,m)
and s_hat * s = I(n,n). Thus, we can solve for u or v in terms of
the other:
v = a'*u*s_hat'
u = a*v*s_hat
"""
m, n = a.shape
if m >= n:
# just return the eigenvalues and eigenvectors of a'a
vecs, vals = _eigh(numpy.dot(a.transpose(), a))
vecs = numpy.where(vecs < 0, 0, vecs)
return vecs, vals
else:
# figure out the eigenvalues and vectors based on aa', which is
smaller
sst_diag, u = _eigh(numpy.dot(a, a.transpose()))
# in case due to numerical instabilities we have sst_diag < 0
anywhere,
# peg them to zero
sst_diag = numpy.where(sst_diag < 0, 0, sst_diag)
# now get the inverse square root of the diagonal, which will
form the
# main diagonal of s_hat
err = numpy.seterr(divide='ignore', invalid='ignore')
s_hat_diag = 1/numpy.sqrt(sst_diag)
numpy.seterr(**err)
s_hat_diag = numpy.where(numpy.isfinite(s_hat_diag), s_hat_diag, 0)
# s_hat_diag is a list of length m, a'u is (n,m), so we can just
use
# numpy's broadcasting instead of matrix multiplication, and only
create
# the upper mxm block of a'u, since that's all we'll use anyway...
v = numpy.dot(a.transpose(), u[:,:m]) * s_hat_diag
return sst_diag, v
def _eigh(m):
values, vectors = numpy.linalg.eigh(m)
order = numpy.flipud(values.argsort())
return values[order], vectors[:,order]
On May 19, 2010, at 10:53 AM, Michael Hull wrote:
> Hi Everybody,
> I am doing some work using numpy/scipy and wanted to find the
> principle components for some data. I can write a fairly simple
> function to do this, but was wondering if there was already a function
> in scipy to do this that I hadn't found before re-inventing the wheel
>
> Many thanks,
>
>
> Mike Hull
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
More information about the SciPy-User
mailing list