[SciPy-User] stats.gaussian_kde sensitive to outliers

David Soukal david.soukal at gmail.com
Sun Sep 29 17:11:11 EDT 2013


Thanks Josef. Seems that you're right, the covariance is indeed much
bigger. I only checked the bandwidth factor before (they are the same).
I'll study the source code to get a clearer understanding of the algorithm.

Thanks!

print(kde.covariance, kde.covariance_factor())

(array([[ 653.77896866]]), 0.25118864315095801)


print(kde_o.covariance, kde_o.covariance_factor())

(array([[ 158142.83936064]]), 0.25113843554287807)





On Sun, Sep 29, 2013 at 2:07 PM, <josef.pktd at gmail.com> wrote:

> On Sun, Sep 29, 2013 at 4:55 PM, David Soukal <david.soukal at gmail.com>
> wrote:
> > Hello folks,
> >
> > I was studying kernel density estimation in "The Elements of Statistical
> > Learning" and playing with stats.gaussian_kde() to confirm my
> understanding.
> >
> > I found that, contrary to my expectation, the estimates seem very
> sensitive
> > to outliers.
> >
> > The code at the end of this email shows this visually. It generates
> n=1000
> > points from normal distribution (d) and adds an outlier to it (d_o). As
> you
> > can see in the graph the estimated density with the outlier looks very
> > different.
> >
> > Inspecting the estimated density at 0 clearly shows that a lot of mass
> has
> > shifted when the outlier is present.
> >
> > print(kde(0))
> > print(kde_o(0))
> >
> > [ 0.00389799]
> > [ 0.0009715]
> >
> >
> > I thought this could be because of the automatic bandwidth estimate but
> it
> > doesn't seem so.
> >
> > I was surprised at this behavior. I thought that an outlier should have
> > little or no impact on the density elsewhere since the Gaussian kernel is
> > exponentially decaying.
>
> quick answer without having time to look at anything.
>
> IIRC:
> scipy gaussian_kde uses the sample variance or covariance matrix to
> scale the bandwidth.
> If you have outliers, then the variance covariance matrix can be
> heavily distorted.
>
> IIRC I have seen it mentioned that R also uses, or has an option to
> use, robust variance estimate (based on interquartile range).
>
> try to set the bandwidth in the outlier case to the same value as in
> the non-outlier case, then there should be only a minor influence on
> the kde except for the neighborhood of the outlier.
>
> Josef
>
>
> >
> > Fitting this in R did produce a more stable estimate.
> >
> > Can you, please, help me understand what I'm missing? Is gaussian_kde()
> > doing something different than a standard smoothing where kernels are
> placed
> > at the sample points and added up?
> >
> > THANKS,
> > David
> >
> >
> > #
> > import matplotlib.pylab as plt
> > import numpy as np
> > import scipy.stats as stats
> >
> > # generate data with and without an outlier
> > n = 1000
> > d = np.random.normal(0, 100, n)
> >
> > outlier = 50000
> > d_o = np.insert(d, 0, outlier)
> >
> > # estimate the densities
> > kde = stats.gaussian_kde(d)
> > kde_o = stats.gaussian_kde(d_o)
> >
> > # plot
> > xs = np.linspace(-400, 400, 1000)
> >
> > fg, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (15, 5))
> > ax.hist(d, bins = 100, normed = True, alpha = 0.5, label = 'Histogram')
> > ax.plot(xs, kde(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE,
> no
> > outlier')
> > ax.plot(xs, kde_o(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE,
> > outlier')
> > ax.legend(loc = 'best')
> >
> >
> >
> > R code for reference.
> >
> > library(ggplot2)
> >
> > c1 <- rnorm(1000, 0, 100)
> > d1 <- data.frame(x = c1)
> >
> > c2 <- c(c1, 50000)
> > d2 <- data.frame(x = c2)
> >
> > kde1 <- density(d1$x, kernel = 'gaussian')
> > kde2 <- density(d2$x, kernel = 'gaussian')
> >
> > ggplot() +
> >   geom_density(data = d1, aes(x = x), color = 'red') +
> >   geom_density(data = d2, aes(x = x), color = 'green') +
> >   coord_cartesian(xlim=c(-400, 400))
> >
> >
> >
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20130929/67b729a5/attachment.html>


More information about the SciPy-User mailing list