[SciPy-Dev] Enhancements to scipy.spatial.cKDTree

Sun Jul 15 20:00:35 EDT 2012

Wow, I must confess I wasn't quite expecting the Spanish Inquisition :-D

The original reason I got into this project was to bring a fast
query_ball_point to cKDTree (which I really needed, the KDTree
implementation is so slow as to be unusable for analysing a long molecular
dynamics trajectory).  When I started, I thought I was only porting over
one method of KDTree, so I strove to keep the code changes to a minimum
(hence not abstracting out the incremental updates to the minimum/maximum
distances between the query point and a node's bounding hyperrectangle) and
to keep a very similar style to the existing code (hence the int's,
double's, malloc/free, etc.).  I only went about porting over the rest of
the methods essentially to submit a complete enhancement to SciPy instead
of just the juicy bit I cared about (after parsing Anne's code, I figured
90% of the effort was done already).  I also agree that Anne's code is
really transparent, and should stay no matter what, but speed does matter a
fair bit to users of kd-trees.  And in my defense with respect to the code
being "full of bugs", I tried hard to make all the unit tests pass on my
machine before submitting and was completely unaware of the nightmares of
32 vs 64-bit ints in Windows 64.

That said, now that the initial effort is complete, it's obvious that a
clear pattern recurs in the kd-tree walking algorithms: keeping incremental
track of the minimum/maximum distances between two hyperrectangles as
they're successively split (alternatively, a hyperrectangle and a point).
If that primitive is abstracted, the code becomes a lot more readable
again.  I've had a first go at what such an abstraction might look like,
and updated the code to query_ball_tree.  Does this look like a way forward
for enhancing the code's maintainability?

Here are another couple of issues / questions:

* Is there any difference between <np.float64_t*>np.PyArray_DATA(my_array)
and &my_array[0] ?  If not, the second choice is clearly preferable.

* I get the issues with int vs np.int vs np.npy_intp, but is there really
any SciPy-supported platform in which double doesn't mean np.float64_t?

* As for memory management, it's probably easier to let NumPy & Python
handle most things, i.e. make Rectangle store two NumPy arrays.  This needs
a bit of surgery to do right, but I've tried to show what I mean in the
query_ball_tree version I'm submitting now.

* I fixed a small glitch with calls to realloc(): Sturla's changes for
handling memory errors ended up setting the first argument to realloc to
NULL, which effectively wipes out the contents of the data that were there
before the resize.

* There seems to be something funny about raw_mins and raw_maxes: they are
member variables of cKDTree, but the data they point to seems to me like it
should go out of scope at the end of cKDTree.__init__.  Am I
misunderstanding something here?

All the best,

Patrick

On Sun, Jul 15, 2012 at 12:29 AM, Sturla Molden <sturla at molden.no> wrote:

>  Den 15.07.2012 00:36, skrev Sturla Molden:
>
>
> Thanks. On Windows 64, query_pair returns bad results depending on the
> size of the tree. I don't know why but Patrick's code hade the same problem.
>
>
> Which brings me to the other issue I'd like to discuss...
>
> In its current state, Patrick's enhancement to cKDTree (IMHO) is close to
> unmaintainable. It might be 'fast', but it is not estetically nice looking
> code (no pun intended),  about 2500 lines of low-level C masquerading as
> Cython, very difficult to read and follow with respect to logic, it was
> full of bugs (and might still be), and still don't pass all the tests on my
> computer (even though I have been debugging it quite extensively).
>
> Who is going to maintain it? Patrick? Ralph? Anne? (I am not
> volunteering...)
>
> On the other hand, Anne's KDTree.py is an example of very beautiful Python
> code, and very easy to read and understand. KDTree is also robust and
> rather bug-free, after having been tested for quite a long time, and more
> or less feature-complete for a KDTree. But it is a bit slow, courtesy to
> Python...
>
> How fast do we need it to be?
>
> I think it might be better to just gently 'Cythonize' out the worst
> bottlenecks in KDTree.py, but leave the rest as it is. It might not be
> equally fast as Patrick's rewrite of cKDTree enhancement, but will be 100
> times easier to maintain, and probably "fast enough" for most usecases.
>
> cKDTree was intended as a very fast kd-tree for nearest-neighbor queries.
> It was never intended as general-purpose kd-tree like, well, KDTree.
> Currently in SciPy master it is very "lean and mean" for its particular
> purpose. (It could still use some minor updating, such as 64-bit support.)
>
>
> I think at least this should be discussed before this major rewrite is
> pulled into SciPy master. I am not saying Patrick's rewrite is bad anbs
> should not be pulled into SciPy. But I don't think we should drop code into
> SciPy if it will never be properly maintained. It must be maintainable for
> years ahead.
>
> Sturla
>
>
>
>
>
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120716/5c533b86/attachment.html>