[SciPy-Dev] Cover trees for nearest neighbors in general metric space

Ralf Gommers ralf.gommers at googlemail.com
Sat Mar 10 15:00:06 EST 2012


On Fri, Mar 9, 2012 at 2:23 AM, Patrick Varilly <patvarilly at gmail.com>wrote:

> Dear all,
>
> Following up from the conversation with Emanuele Olivetti and Jake
> VanderPlas, I've implemented [0] a drop-in replacement for
> scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
> nearest neighbor queries in a general metric space.  To me, this is useful
> for finding nearby particles in a 3D periodic box in the context of
> molecular simulations, but I'm sure it's more generally useful.  It
> addresses the same problem that Jake's BallTrees code addresses in
> scikit-learn, but I've done my best to reproduce the API of
> scipy.spatial.kdtree in order to make this code mostly painless to use.  In
> particular, kd-tree's useful function for finding all the points in one
> tree that are neighbors of every point in another tree (ironically,
> "query_ball_tree") is also implemented here for cover trees.  I modified
> kdtree's extensive unit test to use cover trees, and the code passes it.
>

Out of interest, are you planning to propose this for inclusion in scipy or
scikit-learn once it's done, or keep it as a standalone package?


> The code is informed by what I learned from reading Emanuele's PyCoverTree
> (original authors Thomas Kollar and Nil Geisweiller), but at the end of the
> day, the final internal representation and algorithm implementations are
> quite different.  For example, the tree is quickly built in one go at the
> beginning using the Batch Construction algorithm in the cover tree paper,
> but is then immutable.  For running a query, the algorithm is implemented
> in a way that's much closer to what is done in a kd-tree, instead of how
> it's presented in the paper, which IMHO is somewhat cumbersome.
>
> This is definitely work in progress, and its performance could certainly
> be improved.  But since it may already be useful to others, I'd like to put
> it out there to get some early feedback.  Two things that it *doesn't* do
> yet are use any vectorized form of a distance function (so every distance
> calculation between two points costs one Python function call), and there's
> no speedy Cython version yet.
>

Jaakko Luttinen proposed just two days ago to expose the distance functions
in scipy/spatial/src/distance.c. That may also be useful for you.


> I'm hoping to learn enough Cython over the coming weeks to address both of
> these shortcomings.
>
> Finally, in reading carefully through kdtree.py, I found a clear bug in
> the code, whereby the "eps" parameter for approximate queries doesn't get
> forwarded from the externally visible function "query" to the internal
> function "__query" that does the work.
>

Could you open a ticket for that with a little more detail? Or if you feel
like it, a pull request would be even better:)

Thanks,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120310/f9ff34c5/attachment.html>


More information about the SciPy-Dev mailing list