[SciPy-User] [SciPy-Dev] Entropy from empirical high-dimensional data

Sat May 28 13:58:38 EDT 2011

On Sat, May 28, 2011 at 5:11 AM, Gael Varoquaux
<gael.varoquaux at normalesup.org> wrote:
> On Fri, May 27, 2011 at 05:21:33PM -0400, David Huard wrote:
>> There is a paper by Fernando Pérez-Cruz on the estimation of the
>> Kullback-Leibler divergence directly from samples using KD trees and
>> nearest neighbours. Maybe you could draw inspiration from it to
>> compute the conditional entropy. The author seems to imply that the
>> method works for a sample with over 700 dimensions.
>
>> I'm including an implementation I wrote just a few days ago that
>> hasn't seen a lot of testing yet (BSD, caveat emptor and all that). I
>> checked I could reproduce the example in the paper, but that's it.
>
> This is awesome. It perfectly makes sens to implement the problem this
> way, and it is great to see a paper explaining it clearly. And the value
> of code is huge. Thanks heap for sharing. I was noting yesterday that
> when trying out ideas on data, each new idea requires a full day of work
> because I need to implement a lot of basic blocks, and each time I do
> this quite poorly.
>
> I think that your code, once it has tests, should go in a standard
> package. It is of general interest. scipy.stats seems like a good
> candidate. Alternatively, if it is deemed not suitable, we would
> integrate it in the metrics sub-package of the scikits.learn.

Thanks David, I didn't realize that nearest neighbor can be these easy.

It will for sure find a home also in scikits.statsmodels. I started
with mutual information for independence tests with continuous
distributions a few weeks ago, based on kernel density estimation.
Skipper has written entropy measures for the discrete case.

I have been trying to figure out how well this one neighbor version
works, my guess was that it should have slower convergence. But I have
a hard time coming up with a benchmark for testing this. I didn't find
much for this in R, and the kl.norm for multivariate normal gives
different results than the kdtree version and the version based on
gaussian_kde.

For the univariate case I can calculate the KL divergence with
integrate.quad, testcase normal distribution with different variances.
The kdtree version has a very good mean (sample size 100), better than
the gaussian_kde version that seems to be upward biased. The kdtree
version has about at least twice the standard deviation of the
gaussian version.

The kdtree version is much faster.

I was reading several Monte Carlo comparisons for mutual information
(joint versus product of marginals) and KNN was usually one of the
best methods. I assume that we can extend this version to other
divergence measures.

Does anyone know how to get a "certified" test case for the KL divergence?

Josef

>
> I have moved away from this problem for two reasons. The first being that
> I realized that I could work only from the entropy of the marginals,
> which makes the problem heaps easier. The second being that anyhow, I am
> measuring the entropy of the noise as much as of the signal, and I end up
> being dominated by the entropy of the noise. I could get rid of it if I
> had a way to measure it separately, but that requires more thinking and
> maybe more modelling.
>
> Anyhow, thanks heaps for sharing.
>
> Gaël
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>