[SciPy-User] stats.gaussian_kde sensitive to outliers
David Soukal
david.soukal at gmail.com
Sun Sep 29 16:55:17 EDT 2013
Hello folks,
I was studying kernel density estimation in "The Elements of Statistical
Learning" and playing with stats.gaussian_kde() to confirm my understanding.
I found that, contrary to my expectation, the estimates seem very sensitive
to outliers.
The code at the end of this email shows this visually. It generates n=1000
points from normal distribution (d) and adds an outlier to it (d_o). As you
can see in the graph the estimated density with the outlier looks very
different.
Inspecting the estimated density at 0 clearly shows that a lot of mass has
shifted when the outlier is present.
print(kde(0))
print(kde_o(0))
[ 0.00389799]
[ 0.0009715]
I thought this could be because of the automatic bandwidth estimate but it
doesn't seem so.
I was surprised at this behavior. I thought that an outlier should have
little or no impact on the density elsewhere since the Gaussian kernel is
exponentially decaying.
Fitting this in R did produce a more stable estimate.
Can you, please, help me understand what I'm missing? Is gaussian_kde()
doing something different than a standard smoothing where kernels are
placed at the sample points and added up?
THANKS,
David
#
import matplotlib.pylab as plt
import numpy as np
import scipy.stats as stats
# generate data with and without an outlier
n = 1000
d = np.random.normal(0, 100, n)
outlier = 50000
d_o = np.insert(d, 0, outlier)
# estimate the densities
kde = stats.gaussian_kde(d)
kde_o = stats.gaussian_kde(d_o)
# plot
xs = np.linspace(-400, 400, 1000)
fg, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (15, 5))
ax.hist(d, bins = 100, normed = True, alpha = 0.5, label = 'Histogram')
ax.plot(xs, kde(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE, no
outlier')
ax.plot(xs, kde_o(xs), linewidth = 2, alpha = 0.8, label = 'Gaussian KDE,
outlier')
ax.legend(loc = 'best')
R code for reference.
library(ggplot2)
c1 <- rnorm(1000, 0, 100)
d1 <- data.frame(x = c1)
c2 <- c(c1, 50000)
d2 <- data.frame(x = c2)
kde1 <- density(d1$x, kernel = 'gaussian')
kde2 <- density(d2$x, kernel = 'gaussian')
ggplot() +
geom_density(data = d1, aes(x = x), color = 'red') +
geom_density(data = d2, aes(x = x), color = 'green') +
coord_cartesian(xlim=c(-400, 400))
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20130929/927f5e51/attachment.html>
More information about the SciPy-User
mailing list