[SciPy-User] [SciPy-Dev] Entropy from empirical high-dimensional data

Wed May 25 18:27:26 EDT 2011

On Wed, May 25, 2011 at 16:40, Gael Varoquaux
<gael.varoquaux at normalesup.org> wrote:
> Hi list,
>
> I am looking at estimating entropy and conditional entropy from data for
> which I have only access to observations, and not the underlying
> probabilistic laws.
>
> With low dimensional data, I would simply use an empirical estimate of
> the probabilities by converting each observation to its quantile, and
> then apply the standard formula for entropy (for instance using
> scipy.stats.entropy).
>
> However, I have high-dimensional data (~100 features, and 30000
> observations). Not only is it harder to convert observations to
> probabilities in the empirical law, but I am also worried of curse of
> dimensionality effects: density estimation in high-dimension is a
> difficult problem.
>
> Does anybody has advices, or code in Python to point to, for this task?

This is just from a quick Googling, but it looks like the main
approach is to partition the space into equal-density chunks using an
appropriate partitioning scheme. This one uses k-d trees:

Fast multidimensional entropy estimation by k-d partitioning
http://www.elec.qmul.ac.uk/digitalmusic/papers/2009/StowellPlumbley09entropy.pdf

This one uses a Voronoi tessellation:

A new class of entropy estimators for multi-dimensional densities
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.5037&rep=rep1&type=pdf

They are both unsuitable for ndim=100, however. You may be able to do
something similar with a ball tree and maybe even get a paper out of
it. :-)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco