[SciPy-Dev] Improving spatial.distance code and performance
Kane & Anna O'Donnell
email.the.odonnells at gmail.com
Sun Sep 16 04:26:18 EDT 2018
Sorry, flann and faiss were just examples (I haven't actually researched
different libraries in depth).
> ... approximate distances are probably best left to another package
it looks like to me ... If you want to get really fancy, I'd lean
towards a separate package.
OK, I want to try things which it sounds like scipy won't support, so,
decision made: I'll aim to create a new package. If I actually do it,
and if it's actually 'good', then there'll be a better discussion point
for integrating it (if at all).
> This older discussion on Flann may be relevant:
> https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It
> says Flann only does Euclidean; not sure if that has changed since
> then. Regarding a dependency: https://github.com/mariusmuja/flann is
> basically inactivate for the last years; we wouldn't depend on it but
> could consider vendoring it. However, probably not worth it if it's
> for one method inside euclidean only.
>
> Faiss is still actively developed:
> https://github.com/facebookresearch/faiss, and looks like a much
> better option than Flann. However, something fast-moving like that
> which itself depends on BLAS and has GPU code in it too is not
> something we'd like to depend on nor want to vendor.
>
> Either way, Flann/Faiss is not about a 1:1 Cython translation, but
> about new features. We've got the distance metrics; approximate
> distances are probably best left to another package it looks like to me.
>
> On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski
> <mikofski at berkeley.edu <mailto:mikofski at berkeley.edu>> wrote:
>
> I'm very interested to see how a successful cython/performance PR
> progresses from a reviewers standpoint.
>
> On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy at gmail.com
> <mailto:tyler.je.reddy at gmail.com>> wrote:
>
> Good to see some activity / interest in spatial.
>
> Definitely agree with Ralf's github comments re: using smaller
> / more tractable PRs -- it really is tough to sit down at
> night with 30 minutes of free time or whatever and look at a
> massive diff & not want to give up.
>
> I like the idea of making small / targeted / clearly
> demonstrated performance improvements without overhauling the
> entire infrastructure first, but maybe that's a controversial
> view if it just causes too much heterogeneity.
>
> Presumably the affected code is all thoroughly covered by unit
> tests? That's an important pre-requisite to have the
> confidence to really start making changes, esp. with older
> low-level code like that.
>
> On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell
> <email.the.odonnells at gmail.com
> <mailto:email.the.odonnells at gmail.com>> wrote:
>
> Hi all,
>
> TLDR; I'm wondering about a) porting the spatial.distance
> code to Cython, and b) adding some performance
> optimizations, and I'd like the dev community's
> input/feedback.
>
> For context (though I don't really need you to read these
> to give the feedback I'm looking for),
>
> - original issue/proposal:
> https://github.com/scipy/scipy/issues/9205
> - PR: https://github.com/scipy/scipy/pull/9218
>
> Before submitting the PR, I naively thought it was going
> to be nice Cython or similar. Turns out it's some pretty
> old code, that I found pretty hard to wrap my head around
> and understand. I eventually figured it out after spending
> ages debugging a nastily hidden 'bug', and demonstrated
> the performance optimizations, but it prompted the
> discussion about whether it was best to port everything to
> Cython first.
>
> *Existing stuff to Cython*
>
> Doing so shouldn't be too hard, and it shouldn't change
> any functionality, except to replace the distance
> functions with their Cython ones (instead of the current
> approach, where the distance functions are actually numpy
> things, and there's not supported access to the underlying
> C stuff). A few possible 'bugs' (as above) should
> hopefully become non-issues too. So, it'd be a win for
> performance (e.g. all the distance functions will be much
> faster), and code quality, and future maintainability and
> development. However, things to think about:
>
> - should I just replace like-for-like, or consider some
> nicer OOP stuff like e.g. sklearn's approach (which is
> already Cython)?
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
> (I'm guessing the reason they rolled their own was because
> they found scipy lacking, as above.) In fact, we could
> largely just copy that file over. Not sure about the
> interplay between scipy and scikit learn though.
>
>
> It wouldn't be the first time that code is moved from scikit-learn to
> scipy, that could make sense. Would be good to see if that makes sense
> from scikit-learn's point of view.
>
> - what's the best way to structure a PR? One metric at a
> time, or the whole caboodle?
>
>
> How about one metric first (easier to review), and then the rest in
> one go?
>
>
> *Adding performance optimizations and future functionality*
> *
> *
> As per the links, this was initially about providing a
> nifty performance hack. It should still be pretty easy to
> implement. Personally, I think it makes sense to implement
> after the Cythonization - unless the community are against
> that.
>
> However, there are other possibilities:
>
> - using 'indices' within calculations. E.g. when doing
> pdist, it might pay to use a tree of some description. I
> also proposed another 'index' to optimize the 'bail early'
> approach further (which would, I think, actually work well
> with trees too). This would involve more API changes,
> possibly significant.
>
> - using approximate searches (e.g. Faiss). My
> understanding is that depending on other libraries
> probably isn't really an option, so I'm not sure what means.
> - maybe other cool approaches like
> https://github.com/droyed/eucl_dist
> - providing a way to 'tune' distance computations to be
> optimal to your particular dataset and constraints (e.g.
> my 'bail early' optimization might be a lot faster or a
> bit slower, depending on your data ... or you might be OK
> with approximate matching with a 'low' error rate, etc.)
>
>
> I think some of these are an option; they'd need to be applicable to
> all distance metrics though and not just euclidean or a small subset.
> In the mailing list thread I linked to above there was some discussion
> as well about using kdtree/balltree.
>
>
> I guess what I'd like to see is a single place where users
> can get access to everything related to distance metrics
> and their uses, including all sorts of optimizations etc.
> (possibly for different hardware, and possibly
> distributed). To do that well is a pretty big undertaking,
> and I don't know whether it's suited to scipy - e.g. maybe
> scipy doesn't really care about distance stuff, or only
> wants to stick with 'simple' distance metric cases (e.g. a
> few thousand vectors, etc.). So, maybe it'd be better to
> start a completely new python package - which would
> probably be a lot easier to develop as I'd have a lot more
> flexibility (e.g. to depend on other packages, and not
> have to worry about breaking the scipy API etc.). On the
> other hand (as discussed in the latest comment on the PR),
> that might not be best - it might never get
> used/maintained etc.
>
>
> If you want to get really fancy, I'd lean towards a separate package.
> The idea of your current PR is in scopy for scipy.spatial though. We'd
> also be happy to link to a separate package from the scipy docs.
>
> Cheers,
> Ralf
>
>
>
> So, what does the community think is the best approach?
> I've got too little context of what scipy is and what it's
> aiming for, and I don't want to head off on the wrong
> tack. Comments on any of the other implied questions would
> also be appreciated.
>
> Thanks,
>
> kodonnell
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
> https://mail.python.org/mailman/listinfo/scipy-dev
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
> https://mail.python.org/mailman/listinfo/scipy-dev
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org <mailto:SciPy-Dev at python.org>
> https://mail.python.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180916/70bde40b/attachment-0001.html>
More information about the SciPy-Dev
mailing list