[SciPy-User] Probability Density Estimation

Tue Apr 5 16:12:59 EDT 2011

On Tue, Apr 5, 2011 at 7:10 AM, Hans Georg Schaathun
<hg+scipy at schaathun.net> wrote:
> Hi,
>
> does anyone here have any experience with probability density
> estimation?
>
> I have found the scipy.stats.gaussian_kde, but I find that
> it is extremely sensitive to outliers.  I have datasets which
> tend to have outliers at up to about 35 standard deviations
> in a sample of 500-1000.  The PDF estimate then turns out to
> be close to uniform, and not at all useful.
>
> While I would be grateful for any pointers and advice on the
> numerical and algorithmic sides of my problem, this is a scipy
> list after all.  The scipy question is this: Are there other,
> non-Gaussian KDE-s available, that I have missed?  Or even
> KDE-s which allows the bandwidth to be specified precisely?

There is a univariate kde in statsmodels that allows for a larger
choice of kernels, it's still work in progress,

http://bazaar.launchpad.net/~m-j-a-crowe/statsmodels/mike-branch/files/head:/scikits/statsmodels/sandbox/nonparametric/

Here is a recipe how to subclass scipy.stats.gaussian_kde to set the
bandwidth manually:

http://mail.scipy.org/pipermail/scipy-user/2010-January/023877.html
(I have misplaced the file right now, which happens twice a year.)

As alternative, if they are really outliers, then you could try to
identify them and remove them from the dataset before running the kde.
But maybe they are not outliers and you have a distribution with heavy
tails.
Another alternative would be to try to estimate a mixture model, where
one of the mixture might capture the outliers.

Josef

>
> TIA
> --
> :-- Hans Georg
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>