[SciPy-Dev] Cover trees for nearest neighbors in general metric space
Patrick Varilly
patvarilly at gmail.com
Sat Mar 10 16:26:19 EST 2012
On Sat, Mar 10, 2012 at 8:00 PM, Ralf Gommers
<ralf.gommers at googlemail.com>wrote:
>
>
> On Fri, Mar 9, 2012 at 2:23 AM, Patrick Varilly <patvarilly at gmail.com>wrote:
>
>> Dear all,
>>
>> Following up from the conversation with Emanuele Olivetti and Jake
>> VanderPlas, I've implemented [0] a drop-in replacement for
>> scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
>> nearest neighbor queries in a general metric space. To me, this is useful
>> for finding nearby particles in a 3D periodic box in the context of
>> molecular simulations, but I'm sure it's more generally useful. It
>> addresses the same problem that Jake's BallTrees code addresses in
>> scikit-learn, but I've done my best to reproduce the API of
>> scipy.spatial.kdtree in order to make this code mostly painless to use. In
>> particular, kd-tree's useful function for finding all the points in one
>> tree that are neighbors of every point in another tree (ironically,
>> "query_ball_tree") is also implemented here for cover trees. I modified
>> kdtree's extensive unit test to use cover trees, and the code passes it.
>>
>
> Out of interest, are you planning to propose this for inclusion in scipy
> or scikit-learn once it's done, or keep it as a standalone package?
>
Eventually, I'd like to propose it for inclusion in scipy, since this
functionality is not exclusive to machine learning. I'm using it for
molecular simulations, and didn't even know that nearest-neighbor queries
were useful in machine learning! I would never have though of looking in
scikit-learn for this. But I would like to address the two outstanding
issues (vectorized distances and Cython implementation) before proposing it
for inclusion. On that same vein, I don't know why BallTree is in
scikit-learn and not scipy, for the same reasons.
> This is definitely work in progress, and its performance could certainly
>> be improved. But since it may already be useful to others, I'd like to put
>> it out there to get some early feedback. Two things that it *doesn't* do
>> yet are use any vectorized form of a distance function (so every distance
>> calculation between two points costs one Python function call), and there's
>> no speedy Cython version yet.
>>
>
> Jaakko Luttinen proposed just two days ago to expose the distance
> functions in scipy/spatial/src/distance.c. That may also be useful for you.
>
Thanks, I'll look into it when I get a chance.
> Finally, in reading carefully through kdtree.py, I found a clear bug in
>> the code, whereby the "eps" parameter for approximate queries doesn't get
>> forwarded from the externally visible function "query" to the internal
>> function "__query" that does the work.
>>
>
> Could you open a ticket for that with a little more detail? Or if you feel
> like it, a pull request would be even better:)\
>
I sent in a pull request for it. It's the first time I do this, so
apologies in advance if I somehow screwed it up.
All the best,
Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120310/31f6a55c/attachment.html>
More information about the SciPy-Dev
mailing list