[SciPy-Dev] Expanding Scipy's KDE functionality

Thu Jan 24 11:19:28 EST 2013

On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br at gmail.com> wrote:
> Ok, let's see if I can respond to everyone's comments.
>
> >From Jake:
>
>> 2) The algorithm seems limited to one or maybe two dimensions.
>> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
>> difficult to find a fit for this bandwidth selection method. One option
>> might be to allow this bandwidth selection method via a flag in
>> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
>> too high.  To do that, your code would need to be reworked fairly
>> extensively to fit in the gaussian_kde class.
>
> In principal, this method can be applied in N dimensions. However, I
> think it would be unwise to do so. The method requires that you
> simultaneously estimate the density and the bandwidth. Because of
> that, you have to implement the method on some mesh, and mesh size
> grows exponentially with the number of dimensions. The code certainly
> could be reworked to work in N dimensions, but I don't think it would
> be effective enough to be worth the effort. The results are also
> primarily used for visualization, which is useless beyond 2-d.
>
> >From Ralf:
>
>> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation method but an adaptive density estimator.
>
> It's both. The bandwidth estimate falls out of the density estimate.
> That bandwidth estimate could be easily used to generate an estimate
> on a different mesh.
>
>> My suggestion would be to implement the density estimator and do a good amount of performance testing, at least show that the performance is as good as described in table 1 of the paper.
>
> I can certainly do that. I will post here when the tests are up and running.
>
> >From Barbier de Reuille Pierre:
>
>> It should be easy to separate them and use the estimation of the bandwidth without the density estimation.
>
> Unfortunately, that is not the case. The bandwidth estimate is
> generated from a fixed point calculation based on the norm of a
> derivative of the estimated density. Unless I am missing something, it
> would not be possible to estimate that derivative without an explicit
> density estimate. Fourier coordinates are used because the derivative
> estimate is simpler in those coordinates.
>
>> For example, as stated in the paper, the method is equivalent to a reflexion method with a gaussian kernel. But the renormalisation method gives very similar results too, without enforcing
>> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
>
> I have not currently implemented any boundary corrections, but it
> would not be difficult to implement the renormalization method using
> the bandwidth estimate from this method. It would require a second
> density estimate, but the estimate would be much, much better than the
> current code.
>
>> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-domains (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
>
> In fact, the method currently only works on unbounded domains. The
> exact domain you calculate the density on is an optional parameter to
> the density estimator function. The actual domain you calculate on has
> to be finite because a finite mesh is needed.
>
>> I have a different concern though: is it normal that the density returned by the method is not normalized? (i.e. the integral of the density is far from being one).
>
> That's a bug. I can fix that with one line of code. I have always just
> used the density estimate without units, because they aren't
> particularly informative. However, the output should be normalized, or
> at least a flag included to make it so.
>
>
>
> It seems like the next step is to set up a testing regime for
> comparison to the two existing methods to compare speed and reproduce
> the data from Table 1 in the paper. Also, it seems likely that
> statsmodels is the more appropriate setting for this project. In
> particular, I want to generalize the method to periodic domains, which
> appears to be a novel implementation so more intensive testing will
> likely be needed.

Related to this part:

I would like to have in statsmodels a collection of commonly used
examples processes, and I'd like to add the Marron, Wand examples
there.

For kernel regression, I started with this during the nonparametric merge review
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/nonparametric/dgp_examples.py
https://groups.google.com/d/topic/pystatsmodels/itS_DyPHLA8/discussion

(some of those examples show that also for kernel regression we need
adaptive bandwidth in cases with uneven "smoothness")

I had looked at the Marron Wand examples before, but IIRC, it was for
either orthogonal series or spline estimation.

Statsmodels is using cython to do the binning for the fft based kde,
but I never checked how much the speed gain is compared to
np.histogram for example. (Skipper's work)

Josef

>
> Thanks,
> Daniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev