[SciPy-Dev] Expanding Scipy's KDE functionality

Thu Jan 24 14:13:16 EST 2013

On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br at gmail.com> wrote:
> Ok, let's see if I can respond to everyone's comments.
>
> >From Jake:
>
>> 2) The algorithm seems limited to one or maybe two dimensions.
>> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
>> difficult to find a fit for this bandwidth selection method. One option
>> might be to allow this bandwidth selection method via a flag in
>> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
>> too high.  To do that, your code would need to be reworked fairly
>> extensively to fit in the gaussian_kde class.
>
> In principal, this method can be applied in N dimensions. However, I
> think it would be unwise to do so. The method requires that you
> simultaneously estimate the density and the bandwidth. Because of
> that, you have to implement the method on some mesh, and mesh size
> grows exponentially with the number of dimensions. The code certainly
> could be reworked to work in N dimensions, but I don't think it would
> be effective enough to be worth the effort. The results are also
> primarily used for visualization, which is useless beyond 2-d.
>
> >From Ralf:
>
>> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation method but an adaptive density estimator.
>
> It's both. The bandwidth estimate falls out of the density estimate.
> That bandwidth estimate could be easily used to generate an estimate
> on a different mesh.
>
>> My suggestion would be to implement the density estimator and do a good amount of performance testing, at least show that the performance is as good as described in table 1 of the paper.
>
> I can certainly do that. I will post here when the tests are up and running.
>
> >From Barbier de Reuille Pierre:
>
>> It should be easy to separate them and use the estimation of the bandwidth without the density estimation.
>
> Unfortunately, that is not the case. The bandwidth estimate is
> generated from a fixed point calculation based on the norm of a
> derivative of the estimated density. Unless I am missing something, it
> would not be possible to estimate that derivative without an explicit
> density estimate. Fourier coordinates are used because the derivative
> estimate is simpler in those coordinates.
>
>> For example, as stated in the paper, the method is equivalent to a reflexion method with a gaussian kernel. But the renormalisation method gives very similar results too, without enforcing
>> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
>
> I have not currently implemented any boundary corrections, but it
> would not be difficult to implement the renormalization method using
> the bandwidth estimate from this method. It would require a second
> density estimate, but the estimate would be much, much better than the
> current code.
>
>> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-domains (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
>
> In fact, the method currently only works on unbounded domains. The
> exact domain you calculate the density on is an optional parameter to
> the density estimator function. The actual domain you calculate on has
> to be finite because a finite mesh is needed.

To the domain question:

Besides the boundary problem in bounded domains, there is also the
problem with unbounded domains, that the tails might not be well
captured by a kde, especially with heavier tails.

One idea I would have liked to borrow from matlab since I saw it the
first time a few years ago, is
http://www.mathworks.com/help/stats/paretotails.html
kde (or other nonparametric density) in the middle, paretotails at the ends.
but I never got around to coding this.

Josef

>
>> I have a different concern though: is it normal that the density returned by the method is not normalized? (i.e. the integral of the density is far from being one).
>
> That's a bug. I can fix that with one line of code. I have always just
> used the density estimate without units, because they aren't
> particularly informative. However, the output should be normalized, or
> at least a flag included to make it so.
>
>
>
> It seems like the next step is to set up a testing regime for
> comparison to the two existing methods to compare speed and reproduce
> the data from Table 1 in the paper. Also, it seems likely that
> statsmodels is the more appropriate setting for this project. In
> particular, I want to generalize the method to periodic domains, which
> appears to be a novel implementation so more intensive testing will
> likely be needed.
>
> Thanks,
> Daniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev