[SciPy-Dev] Expanding Scipy's KDE functionality

Thu Jan 24 12:09:46 EST 2013

On Thu, Jan 24, 2013 at 11:55 AM, Daniel Smith
<smith.daniel.br at gmail.com> wrote:
>  <josef.pktd <at> gmail.com> writes:
>
>>
>> On Thu, Jan 24, 2013 at 11:41 AM, Daniel Smith
>> <smith.daniel.br <at> gmail.com> wrote:
>> >  <josef.pktd <at> gmail.com> writes:
>> >
>> >>
>> >> On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br <at> gmail.com> wrote:
>> >> > Ok, let's see if I can respond to everyone's comments.
>> >> >
>> >> > >From Jake:
>> >> >
>> >> >> 2) The algorithm seems limited to one or maybe two dimensions.
>> >> >> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
>> >> >> difficult to find a fit for this bandwidth selection method. One option
>> >> >> might be to allow this bandwidth selection method via a flag in
>> >> >> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
>> >> >> too high.  To do that, your code would need to be reworked fairly
>> >> >> extensively to fit in the gaussian_kde class.
>> >> >
>> >> > In principal, this method can be applied in N dimensions. However, I
>> >> > think it would be unwise to do so. The method requires that you
>> >> > simultaneously estimate the density and the bandwidth. Because of
>> >> > that, you have to implement the method on some mesh, and mesh size
>> >> > grows exponentially with the number of dimensions. The code certainly
>> >> > could be reworked to work in N dimensions, but I don't think it would
>> >> > be effective enough to be worth the effort. The results are also
>> >> > primarily used for visualization, which is useless beyond 2-d.
>> >> >
>> >> > >From Ralf:
>> >> >
>> >> >> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation
>> >> method but an adaptive density estimator.
>> >> >
>> >> > It's both. The bandwidth estimate falls out of the density estimate.
>> >> > That bandwidth estimate could be easily used to generate an estimate
>> >> > on a different mesh.
>> >> >
>> >> >> My suggestion would be to implement the density estimator and do a good amount of
>> > performance testing, at
>> >> least show that the performance is as good as described in table 1 of the paper.
>> >> >
>> >> > I can certainly do that. I will post here when the tests are up and running.
>> >> >
>> >> > >From Barbier de Reuille Pierre:
>> >> >
>> >> >> It should be easy to separate them and use the estimation of the bandwidth without the density
>> > estimation.
>> >> >
>> >> > Unfortunately, that is not the case. The bandwidth estimate is
>> >> > generated from a fixed point calculation based on the norm of a
>> >> > derivative of the estimated density. Unless I am missing something, it
>> >> > would not be possible to estimate that derivative without an explicit
>> >> > density estimate. Fourier coordinates are used because the derivative
>> >> > estimate is simpler in those coordinates.
>> >> >
>> >> >> For example, as stated in the paper, the method is equivalent to a reflexion method with a
>> > gaussian
>> >> kernel. But the renormalisation method gives very similar results too, without enforcing
>> >> >> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
>> >> >
>> >> > I have not currently implemented any boundary corrections, but it
>> >> > would not be difficult to implement the renormalization method using
>> >> > the bandwidth estimate from this method. It would require a second
>> >> > density estimate, but the estimate would be much, much better than the
>> >> > current code.
>> >> >
>> >> >> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-
>> > domains
>> >> (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
>> >> >
>> >> > In fact, the method currently only works on unbounded domains. The
>> >> > exact domain you calculate the density on is an optional parameter to
>> >> > the density estimator function. The actual domain you calculate on has
>> >> > to be finite because a finite mesh is needed.
>> >> >
>> >> >> I have a different concern though: is it normal that the density returned by the method is not
>> > normalized?
>> >> (i.e. the integral of the density is far from being one).
>> >> >
>> >> > That's a bug. I can fix that with one line of code. I have always just
>> >> > used the density estimate without units, because they aren't
>> >> > particularly informative. However, the output should be normalized, or
>> >> > at least a flag included to make it so.
>> >> >
>> >> >
>> >> >
>> >> > It seems like the next step is to set up a testing regime for
>> >> > comparison to the two existing methods to compare speed and reproduce
>> >> > the data from Table 1 in the paper. Also, it seems likely that
>> >> > statsmodels is the more appropriate setting for this project. In
>> >> > particular, I want to generalize the method to periodic domains, which
>> >> > appears to be a novel implementation so more intensive testing will
>> >> > likely be needed.
>> >>
>> >> Related to this part:
>> >>
>> >> I would like to have in statsmodels a collection of commonly used
>> >> examples processes, and I'd like to add the Marron, Wand examples
>> >> there.
>> >
>> > Do you have existing code in statsmodels for this? If I'm already
>> > writing up such a thing for testing, it's worth
>> >  my time to integrate it into statsmodels. A lot of things are already
>> > in np.random, but I could
>> >  extend that in  statsmodels with the examples from the Botev paper
>> > and those from the Marron
>> >  and Wand paper.
>>
>> Nothing yet, I have used a few examples before to try out things, but
>> they are spread over some uncommitted scripts. The idea to add them
>> more systematically for reuse is only recent.
>>
>> Thanks,
>> Josef
>
> Ok. If you suggest a module name, I can start to write such a thing. I
> assume it is ok to base it on numpy.random?

Here is an example how I used it
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/ex_kernel_regression_dgp.py
There are more xxx_dgp.py examples, where I had to work around some
limits in the current design.

``dgp`` for data generating process

how about ``dgp_density.py``?
We put the current module in the sandbox during the merge, because we
still need to adjust it as we get new use cases.

numpy.random is fine in your case, since it's normal mixtures.
For the regression case, I wanted to have a more flexible option for
choosing the distribution of x and the noise.

If part of the discussion gets more statsmodels specific, then I think
it would be more appropriate to take those to the statsmodels list. We
could still keep general kde improvements here.

Josef

>
>>
>> >
>> >>
>> >> For kernel regression, I started with this during the nonparametric merge review
>> >>
>> > https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/nonparametric/dgp_e
>> > xamples.py
>> >> https://groups.google.com/d/topic/pystatsmodels/itS_DyPHLA8/discussion
>> >>
>> >> (some of those examples show that also for kernel regression we need
>> >> adaptive bandwidth in cases with uneven "smoothness")
>> >>
>> >> I had looked at the Marron Wand examples before, but IIRC, it was for
>> >> either orthogonal series or spline estimation.
>> >>
>> >> Statsmodels is using cython to do the binning for the fft based kde,
>> >> but I never checked how much the speed gain is compared to
>> >> np.histogram for example. (Skipper's work)
>> >>
>> >> Josef
>> >>
>> >> >
>> >> > Thanks,
>> >> > Daniel
>> >> > _______________________________________________
>> >> > SciPy-Dev mailing list
>> >> > SciPy-Dev <at> scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >>
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev <at> scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev