[SciPy-User] stats.gaussian_kde sensitive to outliers

David Soukal david.soukal at gmail.com
Sun Sep 29 16:55:17 EDT 2013


Hello folks,

I was studying kernel density estimation in "The Elements of Statistical
Learning" and playing with stats.gaussian_kde() to confirm my understanding.

I found that, contrary to my expectation, the estimates seem very sensitive
to outliers.

The code at the end of this email shows this visually. It generates n=1000
points from normal distribution (d) and adds an outlier to it (d_o). As you
can see in the graph the estimated density with the outlier looks very
different.

Inspecting the estimated density at 0 clearly shows that a lot of mass has
shifted when the outlier is present.

print(kde(0))
print(kde_o(0))

[ 0.00389799]
[ 0.0009715]


I thought this could be because of the automatic bandwidth estimate but it
doesn't seem so.

I was surprised at this behavior. I thought that an outlier should have
little or no impact on the density elsewhere since the Gaussian kernel is
exponentially decaying.

Fitting this in R did produce a more stable estimate.

Can you, please, help me understand what I'm missing? Is gaussian_kde()
doing something different than a standard smoothing where kernels are
placed at the sample points and added up?

THANKS,
David


#
import matplotlib.pylab as plt
import numpy as np
import scipy.stats as stats

# generate data with and without an outlier
n = 1000
d = np.random.normal(0, 100, n)

outlier = 50000
d_o = np.insert(d, 0, outlier)

# estimate the densities
kde = stats.gaussian_kde(d)
kde_o = stats.gaussian_kde(d_o)

# plot
xs = np.linspace(-400, 400, 1000)

fg, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (15, 5))
ax.hist(d, bins = 100, normed = True, alpha = 0.5, label = 'Histogram')
ax.plot(xs, kde(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE, no
outlier')
ax.plot(xs, kde_o(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE,
outlier')
ax.legend(loc = 'best')



R code for reference.

library(ggplot2)

c1 <- rnorm(1000, 0, 100)
d1 <- data.frame(x = c1)

c2 <- c(c1, 50000)
d2 <- data.frame(x = c2)

kde1 <- density(d1$x, kernel = 'gaussian')
kde2 <- density(d2$x, kernel = 'gaussian')

ggplot() +
  geom_density(data = d1, aes(x = x), color = 'red') +
  geom_density(data = d2, aes(x = x), color = 'green') +
  coord_cartesian(xlim=c(-400, 400))
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20130929/927f5e51/attachment.html>


More information about the SciPy-User mailing list