[SciPy-Dev] Improving spatial.distance code and performance

Kane & Anna O'Donnell email.the.odonnells at gmail.com
Sun Sep 16 04:26:18 EDT 2018


Sorry, flann and faiss were just examples (I haven't actually researched 
different libraries in depth).

 > ... approximate distances are probably best left to another package 
it looks like to me ... If you want to get really fancy, I'd lean 
towards a separate package.

OK, I want to try things which it sounds like scipy won't support, so, 
decision made: I'll aim to create a new package. If I actually do it, 
and if it's actually 'good', then there'll be a better discussion point 
for integrating it (if at all).

> This older discussion on Flann may be relevant: 
> https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It 
> says Flann only does Euclidean; not sure if that has changed since 
> then. Regarding a dependency: https://github.com/mariusmuja/flann is 
> basically inactivate for the last years; we wouldn't depend on it but 
> could consider vendoring it. However, probably not worth it if it's 
> for one method inside euclidean only.
>
> Faiss is still actively developed: 
> https://github.com/facebookresearch/faiss, and looks like a much 
> better option than Flann. However, something fast-moving like that 
> which itself depends on BLAS and has GPU code in it too is not 
> something we'd like to depend on nor want to vendor.
>
> Either way, Flann/Faiss is not about a 1:1 Cython translation, but 
> about new features. We've got the distance metrics; approximate 
> distances are probably best left to another package it looks like to me.
>
> On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski 
> <mikofski at berkeley.edu <mailto:mikofski at berkeley.edu>> wrote:
>
>     I'm very interested to see how a successful cython/performance PR
>     progresses from a reviewers standpoint.
>
>     On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy at gmail.com
>     <mailto:tyler.je.reddy at gmail.com>> wrote:
>
>         Good to see some activity / interest in spatial.
>
>         Definitely agree with Ralf's github comments re: using smaller
>         / more tractable PRs -- it really is tough to sit down at
>         night with 30 minutes of free time or whatever and look at a
>         massive diff & not want to give up.
>
>         I like the idea of making small / targeted / clearly
>         demonstrated performance improvements without overhauling the
>         entire infrastructure first, but maybe that's a controversial
>         view if it just causes too much heterogeneity.
>
>         Presumably the affected code is all thoroughly covered by unit
>         tests? That's an important pre-requisite to have the
>         confidence to really start making changes, esp. with older
>         low-level code like that.
>
>         On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell
>         <email.the.odonnells at gmail.com
>         <mailto:email.the.odonnells at gmail.com>> wrote:
>
>             Hi all,
>
>             TLDR; I'm wondering about a) porting the spatial.distance
>             code to Cython, and b) adding some performance
>             optimizations, and I'd like the dev community's
>             input/feedback.
>
>             For context (though I don't really need you to read these
>             to give the feedback I'm looking for),
>
>             - original issue/proposal:
>             https://github.com/scipy/scipy/issues/9205
>             - PR: https://github.com/scipy/scipy/pull/9218
>
>             Before submitting the PR, I naively thought it was going
>             to be nice Cython or similar. Turns out it's some pretty
>             old code, that I found pretty hard to wrap my head around
>             and understand. I eventually figured it out after spending
>             ages debugging a nastily hidden 'bug', and demonstrated
>             the performance optimizations, but it prompted the
>             discussion about whether it was best to port everything to
>             Cython first.
>
>             *Existing stuff to Cython*
>
>             Doing so shouldn't be too hard, and it shouldn't change
>             any functionality, except to replace the distance
>             functions with their Cython ones (instead of the current
>             approach, where the distance functions are actually numpy
>             things, and there's not supported access to the underlying
>             C stuff). A few possible 'bugs' (as above) should
>             hopefully become non-issues too. So, it'd be a win for
>             performance (e.g. all the distance functions will be much
>             faster), and code quality, and future maintainability and
>             development. However, things to think about:
>
>             - should I just replace like-for-like, or consider some
>             nicer OOP stuff like e.g. sklearn's approach (which is
>             already Cython)?
>             https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
>             (I'm guessing the reason they rolled their own was because
>             they found scipy lacking, as above.) In fact, we could
>             largely just copy that file over. Not sure about the
>             interplay between scipy and scikit learn though.
>
>
> It wouldn't be the first time that code is moved from scikit-learn to 
> scipy, that could make sense. Would be good to see if that makes sense 
> from scikit-learn's point of view.
>
>             - what's the best way to structure a PR? One metric at a
>             time, or the whole caboodle?
>
>
> How about one metric first (easier to review), and then the rest in 
> one go?
>
>
>             *Adding performance optimizations and future functionality*
>             *
>             *
>             As per the links, this was initially about providing a
>             nifty performance hack. It should still be pretty easy to
>             implement. Personally, I think it makes sense to implement
>             after the Cythonization - unless the community are against
>             that.
>
>             However, there are other possibilities:
>
>             - using 'indices' within calculations. E.g. when doing
>             pdist, it might pay to use a tree of some description. I
>             also proposed another 'index' to optimize the 'bail early'
>             approach further (which would, I think, actually work well
>             with trees too). This would involve more API changes,
>             possibly significant.
>
>             - using approximate searches (e.g. Faiss). My
>             understanding is that depending on other libraries
>             probably isn't really an option, so I'm not sure what means.
>             - maybe other cool approaches like
>             https://github.com/droyed/eucl_dist
>             - providing a way to 'tune' distance computations to be
>             optimal to your particular dataset and constraints (e.g.
>             my 'bail early' optimization might be a lot faster or a
>             bit slower, depending on your data ... or you might be OK
>             with approximate matching with a 'low' error rate, etc.)
>
>
> I think some of these are an option; they'd need to be applicable to 
> all distance metrics though and not just euclidean or a small subset. 
> In the mailing list thread I linked to above there was some discussion 
> as well about using kdtree/balltree.
>
>
>             I guess what I'd like to see is a single place where users
>             can get access to everything related to distance metrics
>             and their uses, including all sorts of optimizations etc.
>             (possibly for different hardware, and possibly
>             distributed). To do that well is a pretty big undertaking,
>             and I don't know whether it's suited to scipy - e.g. maybe
>             scipy doesn't really care about distance stuff, or only
>             wants to stick with 'simple' distance metric cases (e.g. a
>             few thousand vectors, etc.). So, maybe it'd be better to
>             start a completely new python package - which would
>             probably be a lot easier to develop as I'd have a lot more
>             flexibility (e.g. to depend on other packages, and not
>             have to worry about breaking the scipy API etc.). On the
>             other hand (as discussed in the latest comment on the PR),
>             that might not be best - it might never get
>             used/maintained etc.
>
>
> If you want to get really fancy, I'd lean towards a separate package. 
> The idea of your current PR is in scopy for scipy.spatial though. We'd 
> also be happy to link to a separate package from the scipy docs.
>
> Cheers,
> Ralf
>
>
>
>             So, what does the community think is the best approach?
>             I've got too little context of what scipy is and what it's
>             aiming for, and I don't want to head off on the wrong
>             tack. Comments on any of the other implied questions would
>             also be appreciated.
>
>             Thanks,
>
>             kodonnell
>
>
>             _______________________________________________
>             SciPy-Dev mailing list
>             SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
>             https://mail.python.org/mailman/listinfo/scipy-dev
>
>         _______________________________________________
>         SciPy-Dev mailing list
>         SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
>         https://mail.python.org/mailman/listinfo/scipy-dev
>
>     _______________________________________________
>     SciPy-Dev mailing list
>     SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
>     https://mail.python.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180916/70bde40b/attachment-0001.html>


More information about the SciPy-Dev mailing list