[SciPy-dev] Ball Tree code (ticket 1048)

Jake VanderPlas jakevdp at gmail.com
Tue Nov 17 14:24:23 EST 2009


> > I wrote to the list about this a couple weeks ago, but there hasn't
> > been any activity. ?At the link below, there is a tarball with a
> > distutils setup script, and a simple script that compares the
> > performance of BallTree with scipy.spatial.cKDTree. ?This is my first
> > submission to scipy - I'd love some feedback if you get a chance!
> > http://projects.scipy.org/scipy/ticket/1048

>I took a quick look at the code. I have a few comments:

>* The code is specific to Euclidean distances.

>* The code is specific to R**n.

>* There's no approximate searching.

>* The way the tree is constructed is very nearly a kd-tree: you
>subdivide the dimension with the largest span. The only difference
>between this and a kd-tree is that rather than store a cut plane at
>each node, you store a ball containing the children of that node.

>In principle, ball trees (and similar structures) can work on
>arbitrary metric spaces. I don't know how important that feature is to
>anyone; maybe anyone who needs such a thing needs a smarter algorithm
>than ball trees (e.g. cover trees).

>Given the last point above, I don't really understand why your code is
>faster than cKDTree. Is the difference in how the tree is constructed
>(I think cKDTree uses a different heuristic to find the cut plane) or
>in the fact that you keep track of a ball containing each node? If the
>former, then it might make more sense to add an option of  a different
>heuristic to cKDTree. If the latter, it might make more sense to use a
>different heuristic in the ball tree code - the traditional one is
>partitioning the data into inside/outside a ball at each level of the
>tree, which might allow faster construction, and faster search
>(because the balls might shrink faster as you walk down the tree).

>Overall, it's hard to argue with something that works and is faster in
>concrete cases. But I'm not sure how general the code is.

>Anne

True, the code is not very general.  I wrote it for a specific use
case, using the most general constuction algorithm in the paper linked
below.  Though it is constructed in a way similar to a KD tree, you
gain speed in traversing the tree because your nodes define a much
smaller region than the nodes in a KD-tree, especially as the
dimension gets very large.
You're correct in saying that a Ball Tree can be built upon a general
distance metric, as long as it satisfies the triangle inequality.  It
would be nice to be able to pass any distance metric, for instance one
of the scipy.spatial.distance functions, to use in the ball tree.  I
stopped short of this, because I didn't want to introduce the overhead
of a python function call within the search.  For my own application,
speed was much more important than generality.

We could make the Ball Tree more general without sacrificing speed -
what we would need is to have a library of C distance metrics, wrapped
by python, such that passing one of these functions to the search tree
wrapper would allow the optimized C code direct access to the C
function itself.  Perhaps something similar to the _cpointer attribute
of f2py functions?

paper link: ftp://ftp.icsi.berkeley.edu/pub/techreports/1989/tr-89-063.pdf

-Jake



More information about the SciPy-Dev mailing list