[SciPy-Dev] Expanding Scipy's KDE functionality

Jake Vanderplas vanderplas at astro.washington.edu
Wed Jan 23 15:30:19 EST 2013


Hi Daniel,
That looks like a nice implementation.  My concern about adding it to 
scipy is twofold:

1) Is this a well-known and well-proven technique, or is it more 
cutting-edge?  My view is that scipy should not seek to implement every 
cutting-edge algorithm: in the long-run this will lead to code bloat and 
difficulty of maintenance.  If that's the case, your code might be a 
better fit for statsmodels or another more specialized package.

2) The algorithm seems limited to one or maybe two dimensions. 
scipy.stats.gaussian_kde is designed for N dimensions, so it might be 
difficult to find a fit for this bandwidth selection method. One option 
might be to allow this bandwidth selection method via a flag in 
scipy.stats.gaussian_kde, and raise an error if the dimensionality is 
too high.  To do that, your code would need to be reworked fairly 
extensively to fit in the gaussian_kde class.

I'd like other devs to weigh-in about the algorithm, especially my 
concern #1, before any work starts on a scipy PR.  Thanks,
    Jake

On 01/23/2013 12:11 PM, Daniel Smith wrote:
> Hello,
>
> This was started on a different thread, but I thought I would post a
> new thread focused on this. Currently, I have some existing code that
> implements the bandwidth selection algorithm from:
>
> Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density
> estimation via diffusion. The Annals of Statistics, 38(5):2916-2957,
> 2010.
>
> Zdravko Botev implemented the code in MatLab which can be found here:
>
> http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator
>
> My code for that is here:
>
> https://github.com/Daniel-B-Smith/KDE-for-SciPy
>
> I assume I probably need to find a workaround to avoid the float128 in
> the function fixed_point before I can add it to SciPy. I wrote the
> code a couple of years ago, so it will take me a moment to map out the
> best workaround (there is a very large number being multiplied by a
> very small number). I can also add the 2d-version once I start
> integrating with SciPy. I have a couple of questions remaining. First,
> should I implement this in SciPy? StatsModels? Both? Secondly, can I
> use Cython to generate C code for the function fixed_point? Or do I
> need to write it up in the Numpy C API?
>
> If there is somewhere else I should post this and/or someone I should
> directly contact, I would greatly appreciate it.
>
> Thanks,
> Daniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev





More information about the SciPy-Dev mailing list