[SciPy-User] [SciPy-Dev] Entropy from empirical high-dimensional data

Thu May 26 04:29:26 EDT 2011

2011/5/26 Gael Varoquaux <gael.varoquaux at normalesup.org>

> On Wed, May 25, 2011 at 06:45:02PM -0400, josef.pktd at gmail.com wrote:
> > 30000 doesn't sound like a lot of observations for 100 dimensions,
> > 2**100 bins is pretty large, so binning sounds pretty impossible.
>
> Yes, it's ridiculously bad. I have only started realising how bad it was,
> even though I initialy had such an intuition.
>
> > Are you willing to impose some structure, (a gaussian copula might be
> > able to handle it, or blockwise independence (?)). But even then
> > integration in 100 dimension sounds tough.
>
> > gaussian_kde with Monte Carlo Integration ?
>
> That's definitely something that would be worth a look. It would be very
> slow, though, and this step is already inside a cross-validation loop.
>
> > Maybe a PCA or some other dimension reduction helps, if the data is
> > cluster in some dimensions.
>
> Unfortunately, the goal here is to do model selection on the number of
> components of dimension reduction (iow latent factor analysis).

The guy doing his PhD before my dimensionality reduction thesis worked a
little bit on this. As you don't have much data (30000 is not big for a
100-dimension space), you can check the eigenvalues of the covariance
matrix. When the data cannot be explained by adding a new eigenvector, the
eigenvalue will drop.
These eigenvalues should also have a link to the entropy of your data,
although a linear reduction is not a good estimator IMHO.
This eigenvalue analysis also works with Isomap, LLE, ...

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20110526/47841555/attachment.html>