[SciPy-User] Probability Density Estimation

Wed Apr 6 11:47:26 EDT 2011

On Wed, Apr 6, 2011 at 7:29 AM, Hans Georg Schaathun
<hg+scipy at schaathun.net> wrote:
> [josef.pktd at gmail.com]
>> There is a univariate kde in statsmodels that allows for a larger
>> choice of kernels, it's still work in progress,
>
> Thanks.  Unfortunately, I will need multivariate.
>
> [Zachary Pincus]
>> Or, better, remove them from the dataset before calculating the
>> bandwidth, but add them back for the actual density estimation. Or
>> (effectively the same procedure), substitute in a robust covariance
>> estimator for the calls to numpy.cov (or whatever it is in there) --
>> look e.g. at the MCD method. (Very easy in 1D -- I have code for that
>> special case but not the general case.)
>
> That sounds like a very good idea.  It is a pity that the gaussian_kde
> is not made with a good API to do that sort of things, but with a bit
> of reverse engineering and the pointer below, I am sure I shall manage.
> Thank you very much, both of you.
>
> There is little doubt that the distribution has heavy (or at least long)
> tails combined with an extreme peak around the mean.
>
> What I am actually trying to do is to estimate differential mutual
> information (between a multivariate continuous distribution (feature space)
> and a boolean classification label for machine learning).

Sounds a bit familiar, but with different usage in mind
http://bazaar.launchpad.net/~scipystats/statsmodels/devel/view/head:/scikits/statsmodels/sandbox/distributions/mv_measures.py

and the bug report against myself
https://bugs.launchpad.net/statsmodels/+bug/717511 (see description)
because I also needed more flexible kde.

If you have or find some results and are willing to share, then I will
be very interested.

My objective was measures for general (non-linear) dependence between
two random variables, and tests for independence.

Josef

> This means that I have two samples which should have rather similar
> distributions.  The outlier at 35 standard deviations will likely
> shows up only in one of the two, but it is caused by actual heavy
> tails which should exist in both distributions.  The result is that the
> two estimated distributions differ more than they should.
>
> [josef.pktd at gmail.com]
>> Here is a recipe how to subclass scipy.stats.gaussian_kde to set the
>> bandwidth manually:
>>
>> http://mail.scipy.org/pipermail/scipy-user/2010-January/023877.html
>> (I have misplaced the file right now, which happens twice a year.)
>
> --
> :-- Hans Georg
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>