[SciPy-User] [SciPy-Dev] Entropy from empirical high-dimensional data

Fri May 27 17:21:33 EDT 2011

There is a paper by Fernando Pérez-Cruz on the estimation of the
Kullback-Leibler divergence directly from samples using KD trees and
nearest neighbours. Maybe you could draw inspiration from it to
compute the conditional entropy. The author seems to imply that the
method works for a sample with over 700 dimensions.

I'm including an implementation I wrote just a few days ago that
hasn't seen a lot of testing yet (BSD, caveat emptor and all that). I
checked I could reproduce the example in the paper, but that's it.

http://eprints.pascal-network.org/archive/00004910/01/bare_conf3.pdf

HTH,

David

import numpy as np

#
def KLdivergence(x, y):
    """Compute the Kullback-Leibler divergence between two multivariate samples.

    Parameters
    ----------
    x : 2D array (n,d)
      Samples from distribution P, which typically represents the true
      distribution.
    y : 2D array (m,d)
      Samples from distribution Q, which typically represents the approximate
      distribution.

    Returns
    -------
    out : float
      The estimated Kullback-Leibler divergence D(P||Q).

    References
    ----------
    Pérez-Cruz, F. Kullback-Leibler divergence estimation of
continuous distributions IEEE International Symposium on Information
Theory, 2008.
    """
    from scipy.spatial import cKDTree as KDTree

    # Check the dimensions are consistent
    x = np.atleast_2d(x)
    y = np.atleast_2d(y)

    n,d = x.shape
    m,dy = y.shape

    assert(d == dy)

    # Build a KD tree representation of the samples and find the
nearest neighbour
    # of each point in x.
    xtree = KDTree(x)
    ytree = KDTree(y)

    # Get the first two nearest neighbours for x, since the closest one is the
    # sample itself.
    r = xtree.query(x, k=2, eps=.01, p=2)[0][:,1]
    s = ytree.query(x, k=1, eps=.01, p=2)[0]

    # There is a mistake in the paper. In Eq. 14, the right side
misses a negative sign
    # on the first term of the right hand side.
    return -np.log(r/s).sum() * d / n + np.log(m / (n - 1.))

On Thu, May 26, 2011 at 3:28 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Thu, May 26, 2011 at 00:26, Gael Varoquaux
> <gael.varoquaux at normalesup.org> wrote:
>> On Wed, May 25, 2011 at 06:45:02PM -0400, josef.pktd at gmail.com wrote:
>
>>> Maybe a PCA or some other dimension reduction helps, if the data is
>>> cluster in some dimensions.
>>
>> Unfortunately, the goal here is to do model selection on the number of
>> components of dimension reduction (iow latent factor analysis).
>>
>>> (It's not quite clear whether you have a discrete sample space like in
>>> the reference of Nathaniel, or a continuous space in R^100)
>>
>> Continuous :(.
>>
>> As I mentionned in another mail, as I slept over this problem, I realized
>> that I should be able to work only from the entropy of the marginal
>> distributions, which makes the problem tractable.
>
> That would put an upper bound on the entropy certainly, but it's blind
> to any remaining correlations between variables. It seems like that's
> exactly what you want to find out if you are doing model selection for
> a dimension reduction. I'm probably wrong, but I'd love to know why.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>   -- Umberto Eco
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>