[SciPy-User] Probability Density Estimation
Hans Georg Schaathun
hg+scipy at schaathun.net
Fri Apr 29 15:01:35 EDT 2011
Dear all,
this is a bit overdue; thanks a lot to everyone who helped me
with my KDE trouble three weeks ago.
On Wed, Apr 06, 2011 at 11:47:26AM -0400, josef.pktd at gmail.com wrote:
> If you have or find some results and are willing to share, then I will
> be very interested.
The solution to my problem was to make sure that I use the same
bandwidth for all the KDE estimations relating to the same
variable, i.e. for P(X), P(X|Y=1) and P(X|Y=0). Now the results
seem plausible, even though I did not get the results I was hoping
for :-)
I attach a monkey-patched and annotated version of gaussian_kde making
it easier to choose the bandwidth. The main reason I monkey-patched
rather than inherit is that I needed to annotate it to understand what
I was doing. The comments are intended for pylit and sphinx, but do
not work as well as I would have hoped.
The second attachment is a derived class supporting entropy estimation.
I have not done any really serious testing, and may have made mistakes.
Any feedback would be welcome.
--
:-- Hans Georg
-------------- next part --------------
## $Id: kde.py 2476 2011-04-29 18:51:57Z georg $
## -*- coding: utf-8 -*-
"""
Define classes for (uni/multi)-variate kernel density estimation.
Currently, only Gaussian kernels are implemented.
:Module: itml.kde
:Author: Robert Kern
:Date: $Date: 2011-04-29 20:51:57 +0200 (Fri, 29 Apr 2011) $
:Revision: $Revision: 2476 $
:Copyright: 2004-2005 by Enthought, Inc.
? 2011: Hans Georg Schaathun <georg at schaathun.net>
:Original Date: 2004-08-09
:Modified: 2005-02-10 by Robert Kern. Contributed to Scipy
2005-10-07 by Robert Kern. Some fixes to match the new scipy_core
2011-04-13 by Hans Georg Schaathun.
"""
print "[itml.kde] $Id: kde.py 2476 2011-04-29 18:51:57Z georg $"
# #########################
# Kernel density estimation
# #########################
#
# .. automodule:: itml.kde
#
# Standard library imports::
import warnings
# Scipy imports::
from scipy import linalg, special
from numpy import atleast_2d, reshape, zeros, newaxis, dot, exp, pi, sqrt, \
ravel, power, atleast_1d, squeeze, sum, transpose
import numpy as np
from numpy.random import randint, multivariate_normal
# Local imports::
#import stats
from scipy.stats import mvn
from . import *
__all__ = ['gaussian_kde', ]
# Auxiliary functions
# ===================
#
# .. autofunction:: scotts_factor
#
# .. autofunction:: silverman_factor
#
# Both functions calculate a bandwidth scaling factor based on the sample
# size n and dimensionality d.
#
# ::
def scotts_factor(n,d):
return power(n, -1./(d+4))
def silverman_factor(n,d):
return power(n*(d+2.0)/4.0, -1./(d+4))
# The class
# =========
#
# .. autoclass:: gaussian_kde
class gaussian_kde(object):
"""
Representation of a kernel-density estimate using Gaussian kernels.
Attributes
----------
d : int
number of dimensions
n : int
number of datapoints
Methods
-------
kde.evaluate(points) : array
evaluate the estimated pdf on a provided set of points
kde(points) : array
same as kde.evaluate(points)
kde.setBandwidth(bandwidth) : array
sets the bandwidth matrix, or if H is not specified,
recalculates a bandwidth matrix from the covariance
kde.setBandwidthFactor(factor) : array
sets the scaling factor and recalculates the bandwidth
matrix from the covariance.
kde.integrate_gaussian(mean, cov) : float
multiply pdf with a specified Gaussian and integrate over the
whole domain
kde.integrate_box_1d(low, high) : float
integrate pdf (1D only) between two bounds
kde.integrate_box(low_bounds, high_bounds) : float
integrate pdf over a rectangular space between low_bounds and
high_bounds
kde.integrate_kde(other_kde) : float
integrate two kernel density estimates multiplied together
kde.resample(size=None) : array
randomly sample a dataset from the estimated pdf.
Internal Methods
----------------
kde.covariance_factor() : float
computes the coefficient that multiplies the data covariance matrix
to obtain the kernel covariance matrix.
Parameters
----------
dataset : (# of dims, # of data)-array
datapoints to estimate from
factor : (optional) function, scalar, or string
optionally specify a scaling factor to use to calculate the
bandwidth matrix. It may be a function taking sample size
and dimension as inputs, a scalar, or a string "scotts" or
"silverman" for the standard scaling factors. If a bandwidth
matrix (below) is specified, then factor is ignored.
bandwidth : optional, (# dims, # dims)-array
optionally specify the bandwidth matrix to use with the kernel
"""
# The following dictionary is used to interpret bandwidth factors specified
# as prescriptive strings, and look up corresponding functions.
# ::
bwfactor = {
"scott" : scotts_factor,
"scotts" : scotts_factor,
"silverman" : silverman_factor,
"sc" : scotts_factor,
"si" : silverman_factor,
}
# The constructor will only store the data set and compute sample size,
# dimension, and covariance matrix. All the important calculations are
# made later, upon evaluation or integration.
#
# ::
def __init__(self, dataset, factor="scotts", bandwidth=None ):
self.dataset = atleast_2d(dataset)
self.d, self.n = self.dataset.shape
if bandwidth != None: self.setBandwidth( bandwidth )
else: self.setBandwidthFactor( factor )
# This is the most important method, evaluating the estimated function
# at given points.
#
# .. automethod:: gaussian_kde.evaluate
def evaluate(self, points):
"""
Evaluate the estimated pdf on a set of points.
Parameters
----------
points : (# of dimensions, # of points)-array
Alternatively, a (# of dimensions,) vector can be passed in and
treated as a single point.
Returns
-------
values : (# of points,)-array
The values at each point.
Raises
------
ValueError if the dimensionality of the input points is different than
the dimensionality of the KDE.
"""
# The given evaluation point(s) are converted to a 2-D array of the
# same type as the original sample points.
#
# ::
points = atleast_2d(points).astype(self.dataset.dtype)
# We check that the dimensions match. If a single point was passed
# as a row vector, we can simply transpose it. Any other mismatch
# leads to an exception.
#
# ::
d, m = points.shape
if d != self.d:
if d == 1 and m == self.d:
# points was passed in as a row vector
points = reshape(points, (self.d, 1))
m = 1
else:
msg = "points have dimension %s, dataset has dimension %s" % (d,
self.d)
raise ValueError(msg)
# We initialise the return array with all zeros.
#
# ::
result = zeros((m,), points.dtype)
# For the sake of computational efficiency, we distinguish
# the cases where the sample is smaller or larger than the
# set of evaluation points.
# If there are more points than data, then we loop over data,
# otherwise we will loop over the points.
# The algorithms for the two cases are equivalent,
# but the second case is easier to explain so we leave this
# one uncommented.
#
# ::
if m >= self.n:
for i in range(self.n):
diff = self.dataset[:,i,newaxis] - points
tdiff = dot(self.inv_bandwidth, diff)
energy = sum(diff*tdiff,axis=0)/2.0
result += exp(-energy)
else:
for i in range(m):
# Now, we loop over points.
# Each iteration considers a single evaluation point :math:`x_i`,
# and we calculate the following formula which is the definition
# of the Gaussian KDE.
#
# .. math::
#
# \hat f(x_i) = \frac1{n}\sum_{j=1}^n s_H(X_j-x_i),
#
# where :math:`s_H` is the scaled Gaussian kernel defined,
# for a given bandwidth matrix :math:`H` as
#
# .. math::
#
# s_H(\mathbf{x}) = \frac1{\sqrt{|2\pi H^2|}}
# \exp\big( \frac12 \mathbf{x}^T\cdot (H^2)^{-1}\cdot\mathbf{x} \big).
#
# The bandwidth :math:`H` is calculated by the :meth:`setBandwidth`
# method, and is usually given as :math:`H^2 = h^2\cdot\Sigma^2` where
# :math:`\Sigma^2` is the covariance matrix and :math:`h` is a scalar
# scaling factor.
#
# First we calculate the differences :math:`X_j-x_i` for
# each sample point :math:`X_j` and the evaluation point :math:`x_i`.
# Each column of diff is a difference :math:`X_j - x_i`.
#
# ::
diff = self.dataset - points[:,i,newaxis]
# Now we want to calculate
# :math:`(X_j-x_i)^T\dot\Sigma^{-1}\dot (X_j-x_i)/2` for each :math:`j`
# where :math:`\Sigma` is the covariance matrix.
#
# ::
tdiff = dot(self.inv_bandwidth, diff)
energy = sum(diff*tdiff,axis=0)/2.0
# Each entry in energy corresponds to one sample point :math:`X_j`.
# We apply the exponential function and sum over :math:`j` obtaining the
# result :math:`\hat f(x_i)` as result[i].
#
# ::
result[i] = sum(exp(-energy),axis=0)
# Finally, we divide by the normalisation factor
# :math:`n\sqrt{|2\pi\Sigma|}`.
#
# ::
result /= self._norm_factor
return result
# Just for syntactic sugar, we make the object callable.
# Calling the object is equivalent to calling :meth:`evaluate`.
#
# ::
__call__ = evaluate
# Covariance and Bandwidth
# ========================
#
# ::
def covariance_factor(self):
"""
Calculate the scaling factor for the bandwidth matrix.
This is intended for internal use only.
"""
if callable(self._bwfactor): return self._bwfactor(self.n,self.d)
else: return self._bwfactor
def setBandwidthFactor(self,factor="scotts",recalculate=True):
if self.bwfactor.has_key(factor): self._bwfactor = self.bwfactor[factor]
elif callable(factor): self._bwfactor = factor
else: self._bwfactor = float(factor)
if recalculate: self.setBandwidth()
def getBandwidth(self):
return self.bandwidth
def setBandwidth(self,bandwidth=None,covariance=None):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor
"""
# If the bandwidth is given, we use it. Otherwise we calculate
# a recommended bandwidth from the covariance matrix.
#
# ::
if bandwidth != None:
self.covariance = None
self.bandwidth = bandwidth
else:
factor = self.covariance_factor()
# The covariance attribute is the covariance matrix :math:`\Sigma^2`,
# and the bandwidth attribute is the squared bandwidth matrix
# :math:`\vec{H} = (h\Sigma)^2`.
#
# ::
if covariance == None:
self.covariance = atleast_2d(
np.cov(self.dataset, rowvar=1, bias=False) )
else: self.covariance = covariance
self.bandwidth = self.covariance * factor * factor
# We invert the squared bandwidth matrix here, to make sure it is done
# only once.
#
# ::
try:
self.inv_bandwidth = linalg.inv(self.bandwidth)
except Exception as e:
print_exc(e)
print "Bandwidth:", self.bandwidth
print "Covariance:", self.covariance
print "Factor:", factor
print "Dataset:", self.dataset
raise e
# The normalisation factor consists of three elements.
# Firstly, there is the factor :math:`\sqrt{2\pi}^d` which
# is found in the :math:`d`-variate Gaussian kernel.
# Secondly, there is the square root of the determinant of the bandwidth
# matrix :math:`\sqrt{|h^2\Sigma^2|}`. And finally we have the sample
# size :math:`n` from the KDE itself. Thus, the _norm_factor below
# is equal to
#
# .. math::
#
# 2\pi^{d/2}\sqrt{|h^2\Sigma^2|} \cdot n =
# \sqrt{|2\pi h^2\Sigma^2|} \cdot n.
#
# ::
self._norm_factor = sqrt(linalg.det(2*pi*self.bandwidth)) * self.n
# Other methods
# =============
#
# No changes have been made to the following methods compared to
# :class:`scipy.stats.guassian_kde`.
#
# ::
def resample(self, size=None):
"""Randomly sample a dataset from the estimated pdf.
Parameters
----------
size : int, optional
The number of samples to draw.
If not provided, then the size is the same as the underlying
dataset.
Returns
-------
dataset : (self.d, size)-array
sampled dataset
"""
if size is None:
size = self.n
norm = transpose(multivariate_normal(zeros((self.d,), float),
self.bandwidth, size=size))
indices = randint(0, self.n, size=size)
means = self.dataset[:,indices]
return means + norm
# Integration methods
# ===================
#
# No changes have been made to the integration methods compared to
# :class:`scipy.stats.guassian_kde`.
#
# ::
def integrate_gaussian(self, mean, cov):
"""Multiply estimated density by a multivariate Gaussian and integrate
over the wholespace.
Parameters
----------
mean : vector
the mean of the Gaussian
cov : matrix
the covariance matrix of the Gaussian
Returns
-------
result : scalar
the value of the integral
Raises
------
ValueError if the mean or covariance of the input Gaussian differs from
the KDE's dimensionality.
"""
mean = atleast_1d(squeeze(mean))
cov = atleast_2d(cov)
if mean.shape != (self.d,):
raise ValueError("mean does not have dimension %s" % self.d)
if cov.shape != (self.d, self.d):
raise ValueError("covariance does not have dimension %s" % self.d)
# make mean a column vector
mean = mean[:,newaxis]
sum_cov = self.bandwidth + cov
diff = self.dataset - mean
tdiff = dot(linalg.inv(sum_cov), diff)
energies = sum(diff*tdiff,axis=0)/2.0
result = sum(exp(-energies),axis=0)/sqrt(linalg.det(2*pi*sum_cov))/self.n
return result
def integrate_box_1d(self, low, high):
"""Computes the integral of a 1D pdf between two bounds.
Parameters
----------
low : scalar
lower bound of integration
high : scalar
upper bound of integration
Returns
-------
value : scalar
the result of the integral
Raises
------
ValueError if the KDE is over more than one dimension.
"""
if self.d != 1:
raise ValueError("integrate_box_1d() only handles 1D pdfs")
stdev = ravel(sqrt(self.bandwidth))[0]
normalized_low = ravel((low - self.dataset)/stdev)
normalized_high = ravel((high - self.dataset)/stdev)
value = np.mean(special.ndtr(normalized_high) -
special.ndtr(normalized_low))
return value
def integrate_box(self, low_bounds, high_bounds, maxpts=None):
"""Computes the integral of a pdf over a rectangular interval.
Parameters
----------
low_bounds : vector
lower bounds of integration
high_bounds : vector
upper bounds of integration
maxpts=None : int
maximum number of points to use for integration
Returns
-------
value : scalar
the result of the integral
"""
if maxpts is not None:
extra_kwds = {'maxpts': maxpts}
else:
extra_kwds = {}
value, inform = mvn.mvnun(low_bounds, high_bounds, self.dataset,
self.bandwidth, **extra_kwds)
if inform:
msg = ('an integral in mvn.mvnun requires more points than %s' %
(self.d*1000))
warnings.warn(msg)
return value
def integrate_kde(self, other):
"""Computes the integral of the product of this kernel density estimate
with another.
Parameters
----------
other : gaussian_kde instance
the other kde
Returns
-------
value : scalar
the result of the integral
Raises
------
ValueError if the KDEs have different dimensionality.
"""
if other.d != self.d:
raise ValueError("KDEs are not the same dimensionality")
# we want to iterate over the smallest number of points
if other.n < self.n:
small = other
large = self
else:
small = self
large = other
sum_cov = small.covariance + large.covariance
result = 0.0
for i in range(small.n):
mean = small.dataset[:,i,newaxis]
diff = large.dataset - mean
tdiff = dot(linalg.inv(sum_cov), diff)
energies = sum(diff*tdiff,axis=0)/2.0
result += sum(exp(-energies),axis=0)
result /= sqrt(linalg.det(2*pi*sum_cov))*large.n*small.n
return result
-------------- next part --------------
## $Id: kdetools.py 2290 2011-04-13 19:46:36Z georg $
## -*- coding: utf-8 -*-
"""
Define additional tools for kernel density estimation, including
methods to estimate the entropy.
:Module: itml.kdetools
:Author: Hans Georg Schaathun
:Date: $Date: 2011-04-13 21:46:36 +0200 (Wed, 13 Apr 2011) $
:Revision: $Revision: 2290 $
:Copyright: ? 2011: Hans Georg Schaathun <georg at schaathun.net>
"""
print "[itml.kdetools] $Id: kdetools.py 2290 2011-04-13 19:46:36Z georg $"
# #######################
# KDE Additions and Tools
# #######################
#
# .. automodule:: itml.kdetools
#
# ::
import numpy as np
from .kde import gaussian_kde as gkde
# The main class
# ==============
#
# ::
class xkde(gkde):
"""
Representation of a kernel-density estimate using Gaussian kernels,
including methods to estimate entropy.
"""
# The constructor is completely overridden because other methods have
# changed their signatures.
#
# ::
def __init__(self,dataset,exclude=None,factor="scotts",bandwidth=None,
verbosity=2):
self._verbosity = verbosity
# Firstly, store the dataset and its shape as in the :class:`gkde`
# constructor. We use self.nprime to count the number of points
# used in the bandwidth calculation.
#
# ::
self.dataset = np.atleast_2d(dataset)
self.d, self.n = self.dataset.shape
self.nprime = self.n
# Then we set the bandwidth using the relevant input arguments.
#
# ::
self.setBandwidthFactor( factor, recalculate=False )
if bandwidth != None: self.setBandwidth( bandwidth=bandwidth )
else: self.setBandwidth( exclude )
# We override :meth:`covariance_factor` because we do not use the
# full sample in the bandwidth calculation. Therefore we use self.nprime
# in lieu of self.n.
#
# ::
def covariance_factor(self):
"""
Calculate the scaling factor for the bandwidth matrix.
This is intended for internal use only.
"""
if callable(self._bwfactor): return self._bwfactor(self.nprime,self.d)
else: return self._bwfactor
# Overriding the :meth:`setBandwidth` method is essential. The new
# bandwidth calculation is the main feature of the derived class.
#
# ::
def setBandwidth(self,exclude=None,*a,**kw):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor
"""
# If exclude is not given, we revert to the method of the parent class.
# We reset self.nprime just in case we are recalculating the bandwidth
# after having used a different bandwidth.
#
# ::
if exclude == None:
self.nprime = self.n
return gkde.setBandwidth(self,*a,**kw)
# Calculate the variance for each feature, and build the boolean
# matrix B indicating entries which at most equal to exclude times
# the standard deviation. Squaring the data is faster, and thus
# preferred over calculating the standard deviation.
#
# ::
sigma2 = np.var( self.dataset, axis=1 )
mu = np.mean( self.dataset, axis=1 )
D = (self.dataset - mu) ** 2
S = np.repeat(sigma2[:,np.newaxis],self.n,axis=1)
B = ( D <= exclude * S )
# Now, we get the indices of all the data points with no feature
# exceeding exclude times the standard deviation, and include
# these points only when we calculate the covariance.
#
# ::
idx = [ i for i in xrange(self.n) if B[:,i].all() ]
if self._verbosity > 0:
print "[setBandwidth] Excluding:", self.n - len(idx), \
(np.min(self.dataset,axis=1),
np.max(self.dataset,axis=1) ), self.covariance_factor()
self.nprime = len(idx)
cov = np.atleast_2d( np.cov(self.dataset[:,idx], rowvar=1, bias=False) )
# Check for NaN entries. This should not happen, but in case it does,
# we provide verbose debug information.
#
# ::
if np.isnan(cov).any():
print "[setBandwidth] Covariance matrix contains NaN entries."
print "Original Covariance Matrix:", sigma2
print "Dataset:", self.nprime
print "Covariance Matrix with Exclusions:", cov
raise Exception, "Covariance matrix contains NaN entries."
# Now, we can use the parent class, feeding it the covariance matrix
# we have calculated.
#
# ::
gkde.setBandwidth(self,covariance=cov)
# Entropy Estimation
# ------------------
#
# There are several different formul? for estimating entropy.
# Each of them is based on having an estimate of the PDF first,
# for instance in the form of a kernel density estimate.
#
# An intuitive estimate of the entropy 2 is to substitute the estimated density
# for the theoretical density in the formula for entropy. However, this
# is not quite the favoured method in the literature.
#
# A very simple estimate to calculate is that of Ahmad and Lin,
# which we implement below.
# Contrary to other methods, it requires no integration, and thus
# it avoids the challenge of finding the optimal way of numerical
# integration for the given dimensionality and sample size.
# We implement it as follows.
#
# ::
def ahmadlin(self,verbosity=0):
"""
Estimate the entropy using the formula of Ahmad and Lin.
"""
# First we evaluate the estimated PDF at all the sample points.
#
# ::
y = self.evaluate(np.atleast_2d(self.dataset))
if verbosity > 2:
print "[ahmadlin]", len(y), np.var(self.dataset), np.max(y), np.min(y)
# Then we take the logarithm of all the evaluations and add up to
# get the estimated entropy r.
#
# ::
return - np.sum(np.log(y)) / self.n
contEntropy = ahmadlin
# Functions
# =========
#
# ::
def contEntropy(X,*a,**kw):
"""
Estimate the entropy of a probability distribution based on the given
sample X.
"""
return xkde(X,*a,**kw).contEntropy()
More information about the SciPy-User
mailing list