[SciPy-Dev] Expanding Scipy's KDE functionality

Fri Jan 25 13:13:53 EST 2013

On Fri, Jan 25, 2013 at 11:36 AM, Daniel Smith
<smith.daniel.br at gmail.com> wrote:
> Barbier de Reuille Pierre
>
>> About this: this is incorrect, as you work with a DCT, it is equivalent to repeat the data
>> on both sides by reflexion. Which means your method is equivalent to the reflexion
>> method. Also note this is pointed out in the paper itself. That being said, if there is
>> enough "padding" on both sides (i.e. such that the tail of the kernel is almost 0) there is
>> no effect to this. Also, you can replace the CDT with a FFT to get a cyclic density. I
>> adapted your code for this and it works great!
>
> You are correct. I had always ended up having padding on each side and
> gotten nonsense near the boundary. When I fixed the boundary
> correctly, it gave me nice answers. Could you send me your code for
> the cyclic density? I do some molecular dynamics work, and it would be
> really useful for making angular density plots.
>
>> Back on the computation of the bandwidth, I would argue that you can compute it
>> without computing the density itself. It's true that it makes sense to combine the
>> binning as it useful for both, but I don't agree that it's necessary.j
>
> Let me rephrase my sentiment. I think we can't calculate the bandwidth
> without the moral equivalent of calculating the density. Basically, we
> need a mapping from our set of samples plus our bandwidth to the
> square norm of the n'th derivative. Last night, I came up with a far
> more efficient method that I think demonstrates the moral equivalence.
> With some clever calculus, we can write down the mapping from the
> samples plus bandwidth to the j'th DCT (or Fourier) component. We can
> simply iterate over the DCT components until the change in the
> derivative estimate falls below some threshold. That saves us the
> histogramming step (not that important), but it also means we almost
> assuredly don't need 2**14 DCT components. For all intents in
> purposes, we have also constructed an estimate of the density in our
> DCT components. Without working through the math exactly, I think
> every representation of our data which allows us to estimate the
> density derivative is going to be equivalent, up to isometry, to the
> density itself.
>
> All that is neither here nor there, but certainly let me know if you
> have an idea how we could do such a calculation. I would be very
> interested in finding out that I'm wrong on this point.

It would be useful, for me and maybe to others, if you could use
github to keep track of the different versions (your repo or gists).

I would like to see how the boundary and periodicity are affected by
the different fft and dct, since I bump into this also in other areas.

Thanks,
Josef

>
> Josef:
>
>> Besides the boundary problem in bounded domains, there is also the
>> problem with unbounded domains, that the tails might not be well
>> captured by a kde, especially with heavier tails.
>
> You are absolutely correct, but that is another problem completely.
> Let me know if you implement the Pareto tails idea.
>
>> how about ``dgp_density.py``?
>> We put the current module in the sandbox during the merge, because we
>> still need to adjust it as we get new use cases.
>
> Cool. I'll get started on this over the weekend. Also, numpy.random
> has a whole bunch of distributions. We'll just need to combine them in
> clever ways to get our example distributions.
>
> Thanks,
> Daniel
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev