[SciPy-User] [SciPy-Dev] Entropy from empirical high-dimensional data

Gael Varoquaux gael.varoquaux at normalesup.org
Thu May 26 01:26:48 EDT 2011


On Wed, May 25, 2011 at 06:45:02PM -0400, josef.pktd at gmail.com wrote:
> 30000 doesn't sound like a lot of observations for 100 dimensions,
> 2**100 bins is pretty large, so binning sounds pretty impossible.

Yes, it's ridiculously bad. I have only started realising how bad it was,
even though I initialy had such an intuition.

> Are you willing to impose some structure, (a gaussian copula might be
> able to handle it, or blockwise independence (?)). But even then
> integration in 100 dimension sounds tough.

> gaussian_kde with Monte Carlo Integration ?

That's definitely something that would be worth a look. It would be very
slow, though, and this step is already inside a cross-validation loop.

> Maybe a PCA or some other dimension reduction helps, if the data is
> cluster in some dimensions.

Unfortunately, the goal here is to do model selection on the number of
components of dimension reduction (iow latent factor analysis).

> (It's not quite clear whether you have a discrete sample space like in
> the reference of Nathaniel, or a continuous space in R^100)

Continuous :(.

As I mentionned in another mail, as I slept over this problem, I realized
that I should be able to work only from the entropy of the marginal
distributions, which makes the problem tractable.

Thanks for all the answers, they really helped me developing an
understanding of the problem.

Gaël



More information about the SciPy-User mailing list