[SciPy-Dev] Expanding Scipy's KDE functionality

Thu Jan 24 11:55:15 EST 2013

 <josef.pktd <at> gmail.com> writes:

>
> On Thu, Jan 24, 2013 at 11:41 AM, Daniel Smith
> <smith.daniel.br <at> gmail.com> wrote:
> >  <josef.pktd <at> gmail.com> writes:
> >
> >>
> >> On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br <at> gmail.com> wrote:
> >> > Ok, let's see if I can respond to everyone's comments.
> >> >
> >> > >From Jake:
> >> >
> >> >> 2) The algorithm seems limited to one or maybe two dimensions.
> >> >> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
> >> >> difficult to find a fit for this bandwidth selection method. One option
> >> >> might be to allow this bandwidth selection method via a flag in
> >> >> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
> >> >> too high.  To do that, your code would need to be reworked fairly
> >> >> extensively to fit in the gaussian_kde class.
> >> >
> >> > In principal, this method can be applied in N dimensions. However, I
> >> > think it would be unwise to do so. The method requires that you
> >> > simultaneously estimate the density and the bandwidth. Because of
> >> > that, you have to implement the method on some mesh, and mesh size
> >> > grows exponentially with the number of dimensions. The code certainly
> >> > could be reworked to work in N dimensions, but I don't think it would
> >> > be effective enough to be worth the effort. The results are also
> >> > primarily used for visualization, which is useless beyond 2-d.
> >> >
> >> > >From Ralf:
> >> >
> >> >> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation
> >> method but an adaptive density estimator.
> >> >
> >> > It's both. The bandwidth estimate falls out of the density estimate.
> >> > That bandwidth estimate could be easily used to generate an estimate
> >> > on a different mesh.
> >> >
> >> >> My suggestion would be to implement the density estimator and do a good amount of
> > performance testing, at
> >> least show that the performance is as good as described in table 1 of the paper.
> >> >
> >> > I can certainly do that. I will post here when the tests are up and running.
> >> >
> >> > >From Barbier de Reuille Pierre:
> >> >
> >> >> It should be easy to separate them and use the estimation of the bandwidth without the density
> > estimation.
> >> >
> >> > Unfortunately, that is not the case. The bandwidth estimate is
> >> > generated from a fixed point calculation based on the norm of a
> >> > derivative of the estimated density. Unless I am missing something, it
> >> > would not be possible to estimate that derivative without an explicit
> >> > density estimate. Fourier coordinates are used because the derivative
> >> > estimate is simpler in those coordinates.
> >> >
> >> >> For example, as stated in the paper, the method is equivalent to a reflexion method with a
> > gaussian
> >> kernel. But the renormalisation method gives very similar results too, without enforcing
> >> >> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
> >> >
> >> > I have not currently implemented any boundary corrections, but it
> >> > would not be difficult to implement the renormalization method using
> >> > the bandwidth estimate from this method. It would require a second
> >> > density estimate, but the estimate would be much, much better than the
> >> > current code.
> >> >
> >> >> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-
> > domains
> >> (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
> >> >
> >> > In fact, the method currently only works on unbounded domains. The
> >> > exact domain you calculate the density on is an optional parameter to
> >> > the density estimator function. The actual domain you calculate on has
> >> > to be finite because a finite mesh is needed.
> >> >
> >> >> I have a different concern though: is it normal that the density returned by the method is not
> > normalized?
> >> (i.e. the integral of the density is far from being one).
> >> >
> >> > That's a bug. I can fix that with one line of code. I have always just
> >> > used the density estimate without units, because they aren't
> >> > particularly informative. However, the output should be normalized, or
> >> > at least a flag included to make it so.
> >> >
> >> >
> >> >
> >> > It seems like the next step is to set up a testing regime for
> >> > comparison to the two existing methods to compare speed and reproduce
> >> > the data from Table 1 in the paper. Also, it seems likely that
> >> > statsmodels is the more appropriate setting for this project. In
> >> > particular, I want to generalize the method to periodic domains, which
> >> > appears to be a novel implementation so more intensive testing will
> >> > likely be needed.
> >>
> >> Related to this part:
> >>
> >> I would like to have in statsmodels a collection of commonly used
> >> examples processes, and I'd like to add the Marron, Wand examples
> >> there.
> >
> > Do you have existing code in statsmodels for this? If I'm already
> > writing up such a thing for testing, it's worth
> >  my time to integrate it into statsmodels. A lot of things are already
> > in np.random, but I could
> >  extend that in  statsmodels with the examples from the Botev paper
> > and those from the Marron
> >  and Wand paper.
>
> Nothing yet, I have used a few examples before to try out things, but
> they are spread over some uncommitted scripts. The idea to add them
> more systematically for reuse is only recent.
>
> Thanks,
> Josef

Ok. If you suggest a module name, I can start to write such a thing. I
assume it is ok to base it on numpy.random?

>
> >
> >>
> >> For kernel regression, I started with this during the nonparametric merge review
> >>
> > https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/nonparametric/dgp_e
> > xamples.py
> >> https://groups.google.com/d/topic/pystatsmodels/itS_DyPHLA8/discussion
> >>
> >> (some of those examples show that also for kernel regression we need
> >> adaptive bandwidth in cases with uneven "smoothness")
> >>
> >> I had looked at the Marron Wand examples before, but IIRC, it was for
> >> either orthogonal series or spline estimation.
> >>
> >> Statsmodels is using cython to do the binning for the fft based kde,
> >> but I never checked how much the speed gain is compared to
> >> np.histogram for example. (Skipper's work)
> >>
> >> Josef
> >>
> >> >
> >> > Thanks,
> >> > Daniel
> >> > _______________________________________________
> >> > SciPy-Dev mailing list
> >> > SciPy-Dev <at> scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >>
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev <at> scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>