[SciPy-Dev] Improving spatial.distance code and performance

Kane & Anna O'Donnell email.the.odonnells at gmail.com
Tue Oct 9 03:39:02 EDT 2018


OK, after a bit of playing around, it actually looks like faiss supports
much of what I was intending (including auto-tuning
<https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-cloning-and-hyper-parameter-tuning>
etc.). It's only the l2 norm / dot product, but I figure those cover most
use cases anyway. So maybe I'll submit some PRs to piece-by-piece migrate
scipy.spatial to cython etc.

On Sun, 16 Sep 2018 at 20:26, Kane & Anna O'Donnell <
email.the.odonnells at gmail.com> wrote:

> Sorry, flann and faiss were just examples (I haven't actually researched
> different libraries in depth).
>
> > ... approximate distances are probably best left to another package it
> looks like to me ... If you want to get really fancy, I'd lean towards a
> separate package.
>
> OK, I want to try things which it sounds like scipy won't support, so,
> decision made: I'll aim to create a new package. If I actually do it, and
> if it's actually 'good', then there'll be a better discussion point for
> integrating it (if at all).
>
> This older discussion on Flann may be relevant:
> https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It says
> Flann only does Euclidean; not sure if that has changed since then.
> Regarding a dependency: https://github.com/mariusmuja/flann is basically
> inactivate for the last years; we wouldn't depend on it but could consider
> vendoring it. However, probably not worth it if it's for one method inside
> euclidean only.
>
> Faiss is still actively developed:
> https://github.com/facebookresearch/faiss, and looks like a much better
> option than Flann. However, something fast-moving like that which itself
> depends on BLAS and has GPU code in it too is not something we'd like to
> depend on nor want to vendor.
>
> Either way, Flann/Faiss is not about a 1:1 Cython translation, but about
> new features. We've got the distance metrics; approximate distances are
> probably best left to another package it looks like to me.
>
> On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski <
> mikofski at berkeley.edu> wrote:
>
>> I'm very interested to see how a successful cython/performance PR
>> progresses from a reviewers standpoint.
>>
>> On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy at gmail.com>
>> wrote:
>>
>>> Good to see some activity / interest in spatial.
>>>
>>> Definitely agree with Ralf's github comments re: using smaller / more
>>> tractable PRs -- it really is tough to sit down at night with 30 minutes of
>>> free time or whatever and look at a massive diff & not want to give up.
>>>
>>> I like the idea of making small / targeted / clearly demonstrated
>>> performance improvements without overhauling the entire infrastructure
>>> first, but maybe that's a controversial view if it just causes too much
>>> heterogeneity.
>>>
>>> Presumably the affected code is all thoroughly covered by unit tests?
>>> That's an important pre-requisite to have the confidence to really start
>>> making changes, esp. with older low-level code like that.
>>>
>>> On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell <
>>> email.the.odonnells at gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> TLDR; I'm wondering about a) porting the spatial.distance code to
>>>> Cython, and b) adding some performance optimizations, and I'd like the dev
>>>> community's input/feedback.
>>>>
>>>> For context (though I don't really need you to read these to give the
>>>> feedback I'm looking for),
>>>>
>>>> - original issue/proposal: https://github.com/scipy/scipy/issues/9205
>>>> - PR: https://github.com/scipy/scipy/pull/9218
>>>>
>>>> Before submitting the PR, I naively thought it was going to be nice
>>>> Cython or similar. Turns out it's some pretty old code, that I found pretty
>>>> hard to wrap my head around and understand. I eventually figured it out
>>>> after spending ages debugging a nastily hidden 'bug', and demonstrated the
>>>> performance optimizations, but it prompted the discussion about whether it
>>>> was best to port everything to Cython first.
>>>>
>>>> *Existing stuff to Cython*
>>>>
>>>> Doing so shouldn't be too hard, and it shouldn't change any
>>>> functionality, except to replace the distance functions with their Cython
>>>> ones (instead of the current approach, where the distance functions are
>>>> actually numpy things, and there's not supported access to the underlying C
>>>> stuff). A few possible 'bugs' (as above) should hopefully become non-issues
>>>> too. So, it'd be a win for performance (e.g. all the distance functions
>>>> will be much faster), and code quality, and future maintainability and
>>>> development. However, things to think about:
>>>>
>>>> - should I just replace like-for-like, or consider some nicer OOP stuff
>>>> like e.g. sklearn's approach (which is already Cython)?
>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
>>>> (I'm guessing the reason they rolled their own was because they found scipy
>>>> lacking, as above.) In fact, we could largely just copy that file over. Not
>>>> sure about the interplay between scipy and scikit learn though.
>>>>
>>>
> It wouldn't be the first time that code is moved from scikit-learn to
> scipy, that could make sense. Would be good to see if that makes sense from
> scikit-learn's point of view.
>
>
>> - what's the best way to structure a PR? One metric at a time, or the
>>>> whole caboodle?
>>>>
>>>
> How about one metric first (easier to review), and then the rest in one go?
>
>
>>>> *Adding performance optimizations and future functionality*
>>>>
>>>> As per the links, this was initially about providing a nifty
>>>> performance hack. It should still be pretty easy to implement. Personally,
>>>> I think it makes sense to implement after the Cythonization - unless the
>>>> community are against that.
>>>>
>>>> However, there are other possibilities:
>>>>
>>>> - using 'indices' within calculations. E.g. when doing pdist, it might
>>>> pay to use a tree of some description. I also proposed another 'index' to
>>>> optimize the 'bail early' approach further (which would, I think, actually
>>>> work well with trees too). This would involve more API changes, possibly
>>>> significant.
>>>>
>>> - using approximate searches (e.g. Faiss). My understanding is that
>>>> depending on other libraries probably isn't really an option, so I'm not
>>>> sure what means.
>>>> - maybe other cool approaches like https://github.com/droyed/eucl_dist
>>>> - providing a way to 'tune' distance computations to be optimal to your
>>>> particular dataset and constraints (e.g. my 'bail early' optimization might
>>>> be a lot faster or a bit slower, depending on your data ... or you might be
>>>> OK with approximate matching with a 'low' error rate, etc.)
>>>>
>>>
> I think some of these are an option; they'd need to be applicable to all
> distance metrics though and not just euclidean or a small subset. In the
> mailing list thread I linked to above there was some discussion as well
> about using kdtree/balltree.
>
>
>>>> I guess what I'd like to see is a single place where users can get
>>>> access to everything related to distance metrics and their uses, including
>>>> all sorts of optimizations etc. (possibly for different hardware, and
>>>> possibly distributed). To do that well is a pretty big undertaking, and I
>>>> don't know whether it's suited to scipy - e.g. maybe scipy doesn't really
>>>> care about distance stuff, or only wants to stick with 'simple' distance
>>>> metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better
>>>> to start a completely new python package - which would probably be a lot
>>>> easier to develop as I'd have a lot more flexibility (e.g. to depend on
>>>> other packages, and not have to worry about breaking the scipy API etc.).
>>>> On the other hand (as discussed in the latest comment on the PR), that
>>>> might not be best - it might never get used/maintained etc.
>>>>
>>>
> If you want to get really fancy, I'd lean towards a separate package. The
> idea of your current PR is in scopy for scipy.spatial though. We'd also be
> happy to link to a separate package from the scipy docs.
>
> Cheers,
> Ralf
>
>
>
>>>> So, what does the community think is the best approach? I've got too
>>>> little context of what scipy is and what it's aiming for, and I don't want
>>>> to head off on the wrong tack. Comments on any of the other implied
>>>> questions would also be appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> kodonnell
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev at python.org
>>>> https://mail.python.org/mailman/listinfo/scipy-dev
>>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at python.org
>>> https://mail.python.org/mailman/listinfo/scipy-dev
>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
>
>
> _______________________________________________
> SciPy-Dev mailing listSciPy-Dev at python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20181009/f9984593/attachment-0001.html>


More information about the SciPy-Dev mailing list