[SciPy-Dev] Expanding Scipy's KDE functionality

Fri Jan 25 11:36:56 EST 2013

Barbier de Reuille Pierre

> About this: this is incorrect, as you work with a DCT, it is equivalent to repeat the data
> on both sides by reflexion. Which means your method is equivalent to the reflexion
> method. Also note this is pointed out in the paper itself. That being said, if there is
> enough "padding" on both sides (i.e. such that the tail of the kernel is almost 0) there is
> no effect to this. Also, you can replace the CDT with a FFT to get a cyclic density. I
> adapted your code for this and it works great!

You are correct. I had always ended up having padding on each side and
gotten nonsense near the boundary. When I fixed the boundary
correctly, it gave me nice answers. Could you send me your code for
the cyclic density? I do some molecular dynamics work, and it would be
really useful for making angular density plots.

> Back on the computation of the bandwidth, I would argue that you can compute it
> without computing the density itself. It's true that it makes sense to combine the
> binning as it useful for both, but I don't agree that it's necessary.j

Let me rephrase my sentiment. I think we can't calculate the bandwidth
without the moral equivalent of calculating the density. Basically, we
need a mapping from our set of samples plus our bandwidth to the
square norm of the n'th derivative. Last night, I came up with a far
more efficient method that I think demonstrates the moral equivalence.
With some clever calculus, we can write down the mapping from the
samples plus bandwidth to the j'th DCT (or Fourier) component. We can
simply iterate over the DCT components until the change in the
derivative estimate falls below some threshold. That saves us the
histogramming step (not that important), but it also means we almost
assuredly don't need 2**14 DCT components. For all intents in
purposes, we have also constructed an estimate of the density in our
DCT components. Without working through the math exactly, I think
every representation of our data which allows us to estimate the
density derivative is going to be equivalent, up to isometry, to the
density itself.

All that is neither here nor there, but certainly let me know if you
have an idea how we could do such a calculation. I would be very
interested in finding out that I'm wrong on this point.

Josef:

> Besides the boundary problem in bounded domains, there is also the
> problem with unbounded domains, that the tails might not be well
> captured by a kde, especially with heavier tails.

You are absolutely correct, but that is another problem completely.
Let me know if you implement the Pareto tails idea.

> how about ``dgp_density.py``?
> We put the current module in the sandbox during the merge, because we
> still need to adjust it as we get new use cases.

Cool. I'll get started on this over the weekend. Also, numpy.random
has a whole bunch of distributions. We'll just need to combine them in
clever ways to get our example distributions.

Thanks,
Daniel