[SciPy-User] Probability Density Estimation

Fri Apr 29 15:01:35 EDT 2011

Dear all,

this is a bit overdue; thanks a lot to everyone who helped me
with my KDE trouble three weeks ago.

On Wed, Apr 06, 2011 at 11:47:26AM -0400, josef.pktd at gmail.com wrote:
> If you have or find some results and are willing to share, then I will
> be very interested.

The solution to my problem was to make sure that I use the same
bandwidth for all the KDE estimations relating to the same
variable, i.e. for P(X), P(X|Y=1) and P(X|Y=0).  Now the results
seem plausible, even though I did not get the results I was hoping
for :-)

I attach a monkey-patched and annotated version of gaussian_kde making 
it easier to choose the bandwidth.  The main reason I monkey-patched
rather than inherit is that I needed to annotate it to understand what
I was doing.  The comments are intended for pylit and sphinx, but do
not work as well as I would have hoped.

The second attachment is a derived class supporting entropy estimation.

I have not done any really serious testing, and may have made mistakes.

Any feedback would be welcome.

-- 
:-- Hans Georg
-------------- next part --------------
## $Id: kde.py 2476 2011-04-29 18:51:57Z georg $
## -*- coding: utf-8 -*-

"""
Define classes for (uni/multi)-variate kernel density estimation.

Currently, only Gaussian kernels are implemented.

:Module:    itml.kde
:Author:    Robert Kern
:Date:      $Date: 2011-04-29 20:51:57 +0200 (Fri, 29 Apr 2011) $
:Revision:  $Revision: 2476 $
:Copyright: 2004-2005 by Enthought, Inc.
            ? 2011: Hans Georg Schaathun <georg at schaathun.net>

:Original Date: 2004-08-09
:Modified: 2005-02-10 by Robert Kern.  Contributed to Scipy
           2005-10-07 by Robert Kern.  Some fixes to match the new scipy_core
           2011-04-13 by Hans Georg Schaathun.
"""

print "[itml.kde] $Id: kde.py 2476 2011-04-29 18:51:57Z georg $"

# #########################
# Kernel density estimation
# #########################
# 
# .. automodule:: itml.kde
# 
# Standard library imports::

import warnings

# Scipy imports::

from scipy import linalg, special
from numpy import atleast_2d, reshape, zeros, newaxis, dot, exp, pi, sqrt, \
     ravel, power, atleast_1d, squeeze, sum, transpose
import numpy as np
from numpy.random import randint, multivariate_normal

# Local imports::

#import stats
from scipy.stats import mvn
from . import *

__all__ = ['gaussian_kde', ]

# Auxiliary functions
# ===================
# 
# .. autofunction:: scotts_factor
#
# .. autofunction:: silverman_factor
#
# Both functions calculate a bandwidth scaling factor based on the sample
# size n and dimensionality d.
#
# ::

def scotts_factor(n,d):
  return power(n, -1./(d+4))

def silverman_factor(n,d):
  return power(n*(d+2.0)/4.0, -1./(d+4))

# The class
# =========
# 
# .. autoclass:: gaussian_kde

class gaussian_kde(object):
    """
    Representation of a kernel-density estimate using Gaussian kernels.

    Attributes
    ----------
    d : int
        number of dimensions
    n : int
        number of datapoints

    Methods
    -------
    kde.evaluate(points) : array
        evaluate the estimated pdf on a provided set of points
    kde(points) : array
        same as kde.evaluate(points)
    kde.setBandwidth(bandwidth) : array
	sets the bandwidth matrix, or if H is not specified,
	recalculates a bandwidth matrix from the covariance
    kde.setBandwidthFactor(factor) : array
	sets the scaling factor and recalculates the bandwidth 
	matrix from the covariance.
    kde.integrate_gaussian(mean, cov) : float
        multiply pdf with a specified Gaussian and integrate over the 
	whole domain
    kde.integrate_box_1d(low, high) : float
        integrate pdf (1D only) between two bounds
    kde.integrate_box(low_bounds, high_bounds) : float
        integrate pdf over a rectangular space between low_bounds and
	high_bounds
    kde.integrate_kde(other_kde) : float
        integrate two kernel density estimates multiplied together
    kde.resample(size=None) : array
        randomly sample a dataset from the estimated pdf.

    Internal Methods
    ----------------
    kde.covariance_factor() : float
        computes the coefficient that multiplies the data covariance matrix 
	to obtain the kernel covariance matrix.

    Parameters
    ----------
    dataset : (# of dims, # of data)-array
        datapoints to estimate from
    factor : (optional) function, scalar, or string
	optionally specify a scaling factor to use to calculate the
	bandwidth matrix.  It may be a function taking sample size
	and dimension as inputs, a scalar, or a string "scotts" or
	"silverman" for the standard scaling factors.  If a bandwidth
	matrix (below) is specified, then factor is ignored.
    bandwidth : optional, (# dims, # dims)-array
        optionally specify the bandwidth matrix to use with the kernel

    """

# The following dictionary is used to interpret bandwidth factors specified
# as prescriptive strings, and look up corresponding functions.
#   ::

    bwfactor = {
      "scott" : scotts_factor,
      "scotts" : scotts_factor,
      "silverman" : silverman_factor,
      "sc" : scotts_factor,
      "si" : silverman_factor,
    }

# The constructor will only store the data set and compute sample size,
# dimension, and covariance matrix.  All the important calculations are
# made later, upon evaluation or integration.
# 
#    ::

    def __init__(self, dataset, factor="scotts", bandwidth=None ):
        self.dataset = atleast_2d(dataset)
        self.d, self.n = self.dataset.shape
	if bandwidth != None: self.setBandwidth( bandwidth )
	else: self.setBandwidthFactor( factor )

# This is the most important method, evaluating the estimated function
# at given points.
# 
# .. automethod:: gaussian_kde.evaluate

    def evaluate(self, points):
        """
        Evaluate the estimated pdf on a set of points.

        Parameters
        ----------
        points : (# of dimensions, # of points)-array
            Alternatively, a (# of dimensions,) vector can be passed in and
            treated as a single point.

        Returns
        -------
        values : (# of points,)-array
            The values at each point.

        Raises
        ------
        ValueError if the dimensionality of the input points is different than
        the dimensionality of the KDE.
        """

# The given evaluation point(s) are converted to a 2-D array of the
# same type as the original sample points.  
#     
#     ::

        points = atleast_2d(points).astype(self.dataset.dtype)

# We check that the dimensions match.  If a single point was passed
# as a row vector, we can simply transpose it.  Any other mismatch
# leads to an exception.
# 
#     ::

        d, m = points.shape
        if d != self.d:
            if d == 1 and m == self.d:
                # points was passed in as a row vector
                points = reshape(points, (self.d, 1))
                m = 1
            else:
                msg = "points have dimension %s, dataset has dimension %s" % (d,
                    self.d)
                raise ValueError(msg)

# We initialise the return array with all zeros.
#     
#      ::

        result = zeros((m,), points.dtype)

# For the sake of computational efficiency, we distinguish
# the cases where the sample is smaller or larger than the
# set of evaluation points.  
# If there are more points than data, then we loop over data,
# otherwise we will loop over the points.
# The algorithms for the two cases are equivalent,
# but the second case is easier to explain so we leave this
# one uncommented.
#     
#      ::

        if m >= self.n:
            for i in range(self.n):
                diff = self.dataset[:,i,newaxis] - points
                tdiff = dot(self.inv_bandwidth, diff)
                energy = sum(diff*tdiff,axis=0)/2.0
                result += exp(-energy)
        else:
            for i in range(m):

# Now, we loop over points.
# Each iteration considers a single evaluation point :math:`x_i`,
# and we calculate the following formula which is the definition
# of the Gaussian KDE.
#        
# .. math::
#
#            \hat f(x_i) = \frac1{n}\sum_{j=1}^n s_H(X_j-x_i),
#
# where :math:`s_H` is the scaled Gaussian kernel defined,
# for a given bandwidth matrix :math:`H` as
#
# .. math::
#
#        s_H(\mathbf{x}) = \frac1{\sqrt{|2\pi H^2|}}
#        \exp\big( \frac12 \mathbf{x}^T\cdot (H^2)^{-1}\cdot\mathbf{x} \big).
#
# The bandwidth :math:`H` is calculated by the :meth:`setBandwidth`
# method, and is usually given as :math:`H^2 = h^2\cdot\Sigma^2` where
# :math:`\Sigma^2` is the covariance matrix and :math:`h` is a scalar
# scaling factor.
#        
# First we calculate the differences :math:`X_j-x_i` for
# each sample point :math:`X_j` and the evaluation point :math:`x_i`.
# Each column of diff is a difference :math:`X_j - x_i`.
#        
#      ::

                diff = self.dataset - points[:,i,newaxis]

# Now we want to calculate 
# :math:`(X_j-x_i)^T\dot\Sigma^{-1}\dot (X_j-x_i)/2` for each :math:`j`
# where :math:`\Sigma` is the covariance matrix.
# 
#      ::

                tdiff = dot(self.inv_bandwidth, diff)
                energy = sum(diff*tdiff,axis=0)/2.0

# Each entry in energy corresponds to one sample point :math:`X_j`.
# We apply the exponential function and sum over :math:`j` obtaining the
# result :math:`\hat f(x_i)` as result[i].
# 
#      ::

                result[i] = sum(exp(-energy),axis=0)

# Finally, we divide by the normalisation factor
# :math:`n\sqrt{|2\pi\Sigma|}`.
# 
#      ::

        result /= self._norm_factor

        return result

# Just for syntactic sugar, we make the object callable. 
# Calling the object is equivalent to calling :meth:`evaluate`.
# 
#  ::

    __call__ = evaluate

# Covariance and Bandwidth
# ========================
# 
#   ::

    def covariance_factor(self):
       """
       Calculate the scaling factor for the bandwidth matrix.
       This is intended for internal use only.
       """
       if callable(self._bwfactor): return self._bwfactor(self.n,self.d)
       else: return self._bwfactor

    def setBandwidthFactor(self,factor="scotts",recalculate=True):
       if self.bwfactor.has_key(factor): self._bwfactor = self.bwfactor[factor]
       elif callable(factor): self._bwfactor = factor
       else: self._bwfactor = float(factor)
       if recalculate: self.setBandwidth()

    def getBandwidth(self):
       return self.bandwidth

    def setBandwidth(self,bandwidth=None,covariance=None):
        """Computes the covariance matrix for each Gaussian kernel using
        covariance_factor
        """

# If the bandwidth is given, we use it.  Otherwise we calculate
# a recommended bandwidth from the covariance matrix.
#
#   ::

	if bandwidth != None:
	   self.covariance = None
	   self.bandwidth = bandwidth
	else:
          factor = self.covariance_factor()

# The covariance attribute is the covariance matrix :math:`\Sigma^2`,
# and the bandwidth attribute is the squared bandwidth matrix 
# :math:`\vec{H} = (h\Sigma)^2`.
#
#   ::

	  if covariance == None:
            self.covariance = atleast_2d(
	        np.cov(self.dataset, rowvar=1, bias=False) )
	  else: self.covariance = covariance 
          self.bandwidth = self.covariance * factor * factor

# We invert the squared bandwidth matrix here, to make sure it is done
# only once.
#
#   ::

	try:
          self.inv_bandwidth = linalg.inv(self.bandwidth)
	except Exception as e:
	   print_exc(e)
	   print "Bandwidth:", self.bandwidth
	   print "Covariance:", self.covariance
	   print "Factor:", factor
	   print "Dataset:", self.dataset
	   raise e

# The normalisation factor consists of three elements.
# Firstly, there is the factor :math:`\sqrt{2\pi}^d` which
# is found in the :math:`d`-variate Gaussian kernel.
# Secondly, there is the square root of the determinant of the bandwidth 
# matrix :math:`\sqrt{|h^2\Sigma^2|}`.  And finally we have the sample 
# size :math:`n` from the KDE itself.  Thus, the _norm_factor below
# is equal to
# 
# .. math::
#
#    2\pi^{d/2}\sqrt{|h^2\Sigma^2|} \cdot n =
#    \sqrt{|2\pi h^2\Sigma^2|} \cdot n.
#
# ::

        self._norm_factor = sqrt(linalg.det(2*pi*self.bandwidth)) * self.n

# Other methods
# =============
#
# No changes have been made to the following methods compared to
# :class:`scipy.stats.guassian_kde`.
# 
#   ::

    def resample(self, size=None):
        """Randomly sample a dataset from the estimated pdf.

        Parameters
        ----------
        size : int, optional
            The number of samples to draw.
            If not provided, then the size is the same as the underlying
            dataset.

        Returns
        -------
        dataset : (self.d, size)-array
            sampled dataset
        """

        if size is None:
            size = self.n

        norm = transpose(multivariate_normal(zeros((self.d,), float),
            self.bandwidth, size=size))
        indices = randint(0, self.n, size=size)
        means = self.dataset[:,indices]

        return means + norm

# Integration methods
# ===================
#
# No changes have been made to the integration methods compared to
# :class:`scipy.stats.guassian_kde`.
# 
#   ::

    def integrate_gaussian(self, mean, cov):
        """Multiply estimated density by a multivariate Gaussian and integrate
        over the wholespace.

        Parameters
        ----------
        mean : vector
            the mean of the Gaussian
        cov : matrix
            the covariance matrix of the Gaussian

        Returns
        -------
        result : scalar
            the value of the integral

        Raises
        ------
        ValueError if the mean or covariance of the input Gaussian differs from
        the KDE's dimensionality.
        """

        mean = atleast_1d(squeeze(mean))
        cov = atleast_2d(cov)

        if mean.shape != (self.d,):
            raise ValueError("mean does not have dimension %s" % self.d)
        if cov.shape != (self.d, self.d):
            raise ValueError("covariance does not have dimension %s" % self.d)

        # make mean a column vector
        mean = mean[:,newaxis]

        sum_cov = self.bandwidth + cov

        diff = self.dataset - mean
        tdiff = dot(linalg.inv(sum_cov), diff)

        energies = sum(diff*tdiff,axis=0)/2.0
        result = sum(exp(-energies),axis=0)/sqrt(linalg.det(2*pi*sum_cov))/self.n

        return result

    def integrate_box_1d(self, low, high):
        """Computes the integral of a 1D pdf between two bounds.

        Parameters
        ----------
        low : scalar
            lower bound of integration
        high : scalar
            upper bound of integration

        Returns
        -------
        value : scalar
            the result of the integral

        Raises
        ------
        ValueError if the KDE is over more than one dimension.
        """
        if self.d != 1:
            raise ValueError("integrate_box_1d() only handles 1D pdfs")

        stdev = ravel(sqrt(self.bandwidth))[0]

        normalized_low = ravel((low - self.dataset)/stdev)
        normalized_high = ravel((high - self.dataset)/stdev)

        value = np.mean(special.ndtr(normalized_high) -
                     special.ndtr(normalized_low))
        return value

    def integrate_box(self, low_bounds, high_bounds, maxpts=None):
        """Computes the integral of a pdf over a rectangular interval.

        Parameters
        ----------
        low_bounds : vector
            lower bounds of integration
        high_bounds : vector
            upper bounds of integration
        maxpts=None : int
            maximum number of points to use for integration

        Returns
        -------
        value : scalar
            the result of the integral
        """
        if maxpts is not None:
            extra_kwds = {'maxpts': maxpts}
        else:
            extra_kwds = {}

        value, inform = mvn.mvnun(low_bounds, high_bounds, self.dataset,
            self.bandwidth, **extra_kwds)
        if inform:
            msg = ('an integral in mvn.mvnun requires more points than %s' %
                (self.d*1000))
            warnings.warn(msg)

        return value

    def integrate_kde(self, other):
        """Computes the integral of the product of this  kernel density estimate
        with another.

        Parameters
        ----------
        other : gaussian_kde instance
            the other kde

        Returns
        -------
        value : scalar
            the result of the integral

        Raises
        ------
        ValueError if the KDEs have different dimensionality.
        """

        if other.d != self.d:
            raise ValueError("KDEs are not the same dimensionality")

        # we want to iterate over the smallest number of points
        if other.n < self.n:
            small = other
            large = self
        else:
            small = self
            large = other

        sum_cov = small.covariance + large.covariance
        result = 0.0
        for i in range(small.n):
            mean = small.dataset[:,i,newaxis]
            diff = large.dataset - mean
            tdiff = dot(linalg.inv(sum_cov), diff)

            energies = sum(diff*tdiff,axis=0)/2.0
            result += sum(exp(-energies),axis=0)

        result /= sqrt(linalg.det(2*pi*sum_cov))*large.n*small.n

        return result
-------------- next part --------------
## $Id: kdetools.py 2290 2011-04-13 19:46:36Z georg $
## -*- coding: utf-8 -*-

"""
Define additional tools for kernel density estimation, including
methods to estimate the entropy.

:Module:    itml.kdetools
:Author:    Hans Georg Schaathun
:Date:      $Date: 2011-04-13 21:46:36 +0200 (Wed, 13 Apr 2011) $
:Revision:  $Revision: 2290 $
:Copyright: ? 2011: Hans Georg Schaathun <georg at schaathun.net>
"""

print "[itml.kdetools] $Id: kdetools.py 2290 2011-04-13 19:46:36Z georg $"

# #######################
# KDE Additions and Tools
# #######################
# 
# .. automodule:: itml.kdetools
# 
# ::

import numpy as np
from .kde import gaussian_kde as gkde

# The main class
# ==============
#
# ::

class xkde(gkde):
   """
   Representation of a kernel-density estimate using Gaussian kernels,
   including methods to estimate entropy.
   """

# The constructor is completely overridden because other methods have
# changed their signatures.
#
#   ::

   def __init__(self,dataset,exclude=None,factor="scotts",bandwidth=None,
	 verbosity=2):

      self._verbosity = verbosity

# Firstly, store the dataset and its shape as in the :class:`gkde`
# constructor.  We use self.nprime to count the number of points
# used in the bandwidth calculation.
# 
#   ::

      self.dataset = np.atleast_2d(dataset)
      self.d, self.n = self.dataset.shape
      self.nprime = self.n

# Then we set the bandwidth using the relevant input arguments.
#
#   ::

      self.setBandwidthFactor( factor, recalculate=False )
      if bandwidth != None: self.setBandwidth( bandwidth=bandwidth )
      else: self.setBandwidth( exclude )

# We override :meth:`covariance_factor` because we do not use the
# full sample in the bandwidth calculation.  Therefore we use self.nprime
# in lieu of self.n.
#
#   ::

   def covariance_factor(self):
       """
       Calculate the scaling factor for the bandwidth matrix.
       This is intended for internal use only.
       """
       if callable(self._bwfactor): return self._bwfactor(self.nprime,self.d)
       else: return self._bwfactor

# Overriding the :meth:`setBandwidth` method is essential.  The new 
# bandwidth calculation is the main feature of the derived class.
#
#   ::

   def setBandwidth(self,exclude=None,*a,**kw):
      """Computes the covariance matrix for each Gaussian kernel using
      covariance_factor
      """

# If exclude is not given, we revert to the method of the parent class.
# We reset self.nprime just in case we are recalculating the bandwidth
# after having used a different bandwidth.
#
#   ::

      if exclude == None:
	 self.nprime = self.n
	 return gkde.setBandwidth(self,*a,**kw)

# Calculate the variance for each feature, and build the boolean
# matrix B indicating entries which at most equal to exclude times
# the standard deviation.  Squaring the data is faster, and thus
# preferred over calculating the standard deviation.
#
#   ::

      sigma2 = np.var( self.dataset, axis=1 )
      mu = np.mean( self.dataset, axis=1 )
      D = (self.dataset - mu) ** 2
      S = np.repeat(sigma2[:,np.newaxis],self.n,axis=1)
      B = ( D <= exclude * S )

# Now, we get the indices of all the data points with no feature
# exceeding exclude times the standard deviation, and include
# these points only when we calculate the covariance.
#
#   ::

      idx = [ i for i in xrange(self.n) if B[:,i].all() ]
      if self._verbosity > 0:
	 print "[setBandwidth] Excluding:", self.n - len(idx), \
	       (np.min(self.dataset,axis=1),
	        np.max(self.dataset,axis=1) ), self.covariance_factor()
      self.nprime = len(idx)
      cov = np.atleast_2d( np.cov(self.dataset[:,idx], rowvar=1, bias=False) )

# Check for NaN entries.  This should not happen, but in case it does,
# we provide verbose debug information.
#
#   ::

      if np.isnan(cov).any():
	 print "[setBandwidth] Covariance matrix contains NaN entries."
	 print "Original Covariance Matrix:", sigma2
	 print "Dataset:", self.nprime
	 print "Covariance Matrix with Exclusions:", cov
	 raise Exception, "Covariance matrix contains NaN entries."

# Now, we can use the parent class, feeding it the covariance matrix
# we have calculated.
#
#   ::

      gkde.setBandwidth(self,covariance=cov)

# Entropy Estimation
# ------------------
#
# There are several different formul? for estimating entropy.
# Each of them is based on having an estimate of the PDF first, 
# for instance in the form of a kernel density estimate.
# 
# An intuitive estimate of the entropy 2 is to substitute the estimated density
# for the theoretical density in the formula for entropy. However, this
# is not quite the favoured method in the literature.
#
# A very simple estimate to calculate is that of Ahmad and Lin,
# which we implement below.
# Contrary to other methods, it requires no integration, and thus
# it avoids the challenge of finding the optimal way of numerical 
# integration for the given dimensionality and sample size.
# We implement it as follows.
#
# ::

   def ahmadlin(self,verbosity=0):
     """
     Estimate the entropy using the formula of Ahmad and Lin.
     """

# First we evaluate the estimated PDF at all the sample points.
#
#    ::

     y = self.evaluate(np.atleast_2d(self.dataset))
     if verbosity > 2:
       print "[ahmadlin]", len(y), np.var(self.dataset), np.max(y), np.min(y)

# Then we take the logarithm of all the evaluations and add up to
# get the estimated entropy r.
#
#   ::

     return - np.sum(np.log(y)) / self.n

   contEntropy = ahmadlin

# Functions
# =========
#
# ::

def contEntropy(X,*a,**kw):
   """
   Estimate the entropy of a probability distribution based on the given
   sample X.
   """
   return xkde(X,*a,**kw).contEntropy()