[SciPy-Dev] Expanding Scipy's KDE functionality

Wed Jan 23 17:40:09 EST 2013

On Wed, Jan 23, 2013 at 3:30 PM, Jake Vanderplas
<vanderplas at astro.washington.edu> wrote:
> Hi Daniel,
> That looks like a nice implementation.  My concern about adding it to
> scipy is twofold:
>
> 1) Is this a well-known and well-proven technique, or is it more
> cutting-edge?  My view is that scipy should not seek to implement every
> cutting-edge algorithm: in the long-run this will lead to code bloat and
> difficulty of maintenance.  If that's the case, your code might be a
> better fit for statsmodels or another more specialized package.

146 citations in google scholar for the paper since 2010 across many fields
169 downloads in the last month for the matlab version

The availability of the matlab code is increasing the number of
citations, from what I can see in a few examples.

So, it looks popular and it works, even if it's new. Using fft for kde
is old, but I didn't look yet at the details.

>
> 2) The algorithm seems limited to one or maybe two dimensions.
> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
> difficult to find a fit for this bandwidth selection method. One option
> might be to allow this bandwidth selection method via a flag in
> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
> too high.  To do that, your code would need to be reworked fairly
> extensively to fit in the gaussian_kde class.

My guess is that it doesn't make much sense to merge it into gaussian_kde.
I doubt there will be much direct code sharing, and the implementation
differs quite a bit.
In statsmodels we have separate classes for univariate and multivariate kde
(although most of the kernel density estimation and kernel regression
in statsmodels is new and not settled yet).

Josef
>
> I'd like other devs to weigh-in about the algorithm, especially my
> concern #1, before any work starts on a scipy PR.  Thanks,
>     Jake
>
> On 01/23/2013 12:11 PM, Daniel Smith wrote:
>> Hello,
>>
>> This was started on a different thread, but I thought I would post a
>> new thread focused on this. Currently, I have some existing code that
>> implements the bandwidth selection algorithm from:
>>
>> Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density
>> estimation via diffusion. The Annals of Statistics, 38(5):2916-2957,
>> 2010.
>>
>> Zdravko Botev implemented the code in MatLab which can be found here:
>>
>> http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator
>>
>> My code for that is here:
>>
>> https://github.com/Daniel-B-Smith/KDE-for-SciPy
>>
>> I assume I probably need to find a workaround to avoid the float128 in
>> the function fixed_point before I can add it to SciPy. I wrote the
>> code a couple of years ago, so it will take me a moment to map out the
>> best workaround (there is a very large number being multiplied by a
>> very small number). I can also add the 2d-version once I start
>> integrating with SciPy. I have a couple of questions remaining. First,
>> should I implement this in SciPy? StatsModels? Both? Secondly, can I
>> use Cython to generate C code for the function fixed_point? Or do I
>> need to write it up in the Numpy C API?
>>
>> If there is somewhere else I should post this and/or someone I should
>> directly contact, I would greatly appreciate it.
>>
>> Thanks,
>> Daniel
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev