[SciPy-Dev] Improving spatial.distance code and performance

Kane & Anna O'Donnell email.the.odonnells at gmail.com
Thu Sep 6 05:28:49 EDT 2018


Hi all,

TLDR; I'm wondering about a) porting the spatial.distance code to Cython,
and b) adding some performance optimizations, and I'd like the dev
community's input/feedback.

For context (though I don't really need you to read these to give the
feedback I'm looking for),

- original issue/proposal: https://github.com/scipy/scipy/issues/9205
- PR: https://github.com/scipy/scipy/pull/9218

Before submitting the PR, I naively thought it was going to be nice Cython
or similar. Turns out it's some pretty old code, that I found pretty hard
to wrap my head around and understand. I eventually figured it out after
spending ages debugging a nastily hidden 'bug', and demonstrated the
performance optimizations, but it prompted the discussion about whether it
was best to port everything to Cython first.

*Existing stuff to Cython*

Doing so shouldn't be too hard, and it shouldn't change any functionality,
except to replace the distance functions with their Cython ones (instead of
the current approach, where the distance functions are actually numpy
things, and there's not supported access to the underlying C stuff). A few
possible 'bugs' (as above) should hopefully become non-issues too. So, it'd
be a win for performance (e.g. all the distance functions will be much
faster), and code quality, and future maintainability and development.
However, things to think about:

- should I just replace like-for-like, or consider some nicer OOP stuff
like e.g. sklearn's approach (which is already Cython)?
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
(I'm guessing the reason they rolled their own was because they found scipy
lacking, as above.) In fact, we could largely just copy that file over. Not
sure about the interplay between scipy and scikit learn though.
- what's the best way to structure a PR? One metric at a time, or the whole
caboodle?

*Adding performance optimizations and future functionality*

As per the links, this was initially about providing a nifty performance
hack. It should still be pretty easy to implement. Personally, I think it
makes sense to implement after the Cythonization - unless the community are
against that.

However, there are other possibilities:

- using 'indices' within calculations. E.g. when doing pdist, it might pay
to use a tree of some description. I also proposed another 'index' to
optimize the 'bail early' approach further (which would, I think, actually
work well with trees too). This would involve more API changes, possibly
significant.
- using approximate searches (e.g. Faiss). My understanding is that
depending on other libraries probably isn't really an option, so I'm not
sure what means.
- maybe other cool approaches like https://github.com/droyed/eucl_dist
- providing a way to 'tune' distance computations to be optimal to your
particular dataset and constraints (e.g. my 'bail early' optimization might
be a lot faster or a bit slower, depending on your data ... or you might be
OK with approximate matching with a 'low' error rate, etc.)

I guess what I'd like to see is a single place where users can get access
to everything related to distance metrics and their uses, including all
sorts of optimizations etc. (possibly for different hardware, and possibly
distributed). To do that well is a pretty big undertaking, and I don't know
whether it's suited to scipy - e.g. maybe scipy doesn't really care about
distance stuff, or only wants to stick with 'simple' distance metric cases
(e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a
completely new python package - which would probably be a lot easier to
develop as I'd have a lot more flexibility (e.g. to depend on other
packages, and not have to worry about breaking the scipy API etc.). On the
other hand (as discussed in the latest comment on the PR), that might not
be best - it might never get used/maintained etc.

So, what does the community think is the best approach? I've got too little
context of what scipy is and what it's aiming for, and I don't want to head
off on the wrong tack. Comments on any of the other implied questions would
also be appreciated.

Thanks,

kodonnell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180906/af09148b/attachment-0001.html>


More information about the SciPy-Dev mailing list