[SciPy-Dev] Cover trees for nearest neighbors in general metric space

Patrick Varilly patvarilly at gmail.com
Sat Mar 10 16:26:19 EST 2012


On Sat, Mar 10, 2012 at 8:00 PM, Ralf Gommers
<ralf.gommers at googlemail.com>wrote:

>
>
> On Fri, Mar 9, 2012 at 2:23 AM, Patrick Varilly <patvarilly at gmail.com>wrote:
>
>> Dear all,
>>
>> Following up from the conversation with Emanuele Olivetti and Jake
>> VanderPlas, I've implemented [0] a drop-in replacement for
>> scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
>> nearest neighbor queries in a general metric space.  To me, this is useful
>> for finding nearby particles in a 3D periodic box in the context of
>> molecular simulations, but I'm sure it's more generally useful.  It
>> addresses the same problem that Jake's BallTrees code addresses in
>> scikit-learn, but I've done my best to reproduce the API of
>> scipy.spatial.kdtree in order to make this code mostly painless to use.  In
>> particular, kd-tree's useful function for finding all the points in one
>> tree that are neighbors of every point in another tree (ironically,
>> "query_ball_tree") is also implemented here for cover trees.  I modified
>> kdtree's extensive unit test to use cover trees, and the code passes it.
>>
>
> Out of interest, are you planning to propose this for inclusion in scipy
> or scikit-learn once it's done, or keep it as a standalone package?
>

Eventually, I'd like to propose it for inclusion in scipy, since this
functionality is not exclusive to machine learning.  I'm using it for
molecular simulations, and didn't even know that nearest-neighbor queries
were useful in machine learning!  I would never have though of looking in
scikit-learn for this.  But I would like to address the two outstanding
issues (vectorized distances and Cython implementation) before proposing it
for inclusion.  On that same vein, I don't know why BallTree is in
scikit-learn and not scipy, for the same reasons.


> This is definitely work in progress, and its performance could certainly
>> be improved.  But since it may already be useful to others, I'd like to put
>> it out there to get some early feedback.  Two things that it *doesn't* do
>> yet are use any vectorized form of a distance function (so every distance
>> calculation between two points costs one Python function call), and there's
>> no speedy Cython version yet.
>>
>
> Jaakko Luttinen proposed just two days ago to expose the distance
> functions in scipy/spatial/src/distance.c. That may also be useful for you.
>

Thanks, I'll look into it when I get a chance.


> Finally, in reading carefully through kdtree.py, I found a clear bug in
>> the code, whereby the "eps" parameter for approximate queries doesn't get
>> forwarded from the externally visible function "query" to the internal
>> function "__query" that does the work.
>>
>
> Could you open a ticket for that with a little more detail? Or if you feel
> like it, a pull request would be even better:)\
>

I sent in a pull request for it.  It's the first time I do this, so
apologies in advance if I somehow screwed it up.

All the best,

Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120310/31f6a55c/attachment.html>


More information about the SciPy-Dev mailing list