[SciPy-Dev] Improving spatial.distance code and performance

Mon Sep 10 13:05:17 EDT 2018

I'm very interested to see how a successful cython/performance PR
progresses from a reviewers standpoint.

On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy at gmail.com> wrote:

> Good to see some activity / interest in spatial.
>
> Definitely agree with Ralf's github comments re: using smaller / more
> tractable PRs -- it really is tough to sit down at night with 30 minutes of
> free time or whatever and look at a massive diff & not want to give up.
>
> I like the idea of making small / targeted / clearly demonstrated
> performance improvements without overhauling the entire infrastructure
> first, but maybe that's a controversial view if it just causes too much
> heterogeneity.
>
> Presumably the affected code is all thoroughly covered by unit tests?
> That's an important pre-requisite to have the confidence to really start
> making changes, esp. with older low-level code like that.
>
> On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell <
> email.the.odonnells at gmail.com> wrote:
>
>> Hi all,
>>
>> TLDR; I'm wondering about a) porting the spatial.distance code to Cython,
>> and b) adding some performance optimizations, and I'd like the dev
>> community's input/feedback.
>>
>> For context (though I don't really need you to read these to give the
>> feedback I'm looking for),
>>
>> - original issue/proposal: https://github.com/scipy/scipy/issues/9205
>> - PR: https://github.com/scipy/scipy/pull/9218
>>
>> Before submitting the PR, I naively thought it was going to be nice
>> Cython or similar. Turns out it's some pretty old code, that I found pretty
>> hard to wrap my head around and understand. I eventually figured it out
>> after spending ages debugging a nastily hidden 'bug', and demonstrated the
>> performance optimizations, but it prompted the discussion about whether it
>> was best to port everything to Cython first.
>>
>> *Existing stuff to Cython*
>>
>> Doing so shouldn't be too hard, and it shouldn't change any
>> functionality, except to replace the distance functions with their Cython
>> ones (instead of the current approach, where the distance functions are
>> actually numpy things, and there's not supported access to the underlying C
>> stuff). A few possible 'bugs' (as above) should hopefully become non-issues
>> too. So, it'd be a win for performance (e.g. all the distance functions
>> will be much faster), and code quality, and future maintainability and
>> development. However, things to think about:
>>
>> - should I just replace like-for-like, or consider some nicer OOP stuff
>> like e.g. sklearn's approach (which is already Cython)?
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
>> (I'm guessing the reason they rolled their own was because they found scipy
>> lacking, as above.) In fact, we could largely just copy that file over. Not
>> sure about the interplay between scipy and scikit learn though.
>> - what's the best way to structure a PR? One metric at a time, or the
>> whole caboodle?
>>
>> *Adding performance optimizations and future functionality*
>>
>> As per the links, this was initially about providing a nifty performance
>> hack. It should still be pretty easy to implement. Personally, I think it
>> makes sense to implement after the Cythonization - unless the community are
>> against that.
>>
>> However, there are other possibilities:
>>
>> - using 'indices' within calculations. E.g. when doing pdist, it might
>> pay to use a tree of some description. I also proposed another 'index' to
>> optimize the 'bail early' approach further (which would, I think, actually
>> work well with trees too). This would involve more API changes, possibly
>> significant.
>> - using approximate searches (e.g. Faiss). My understanding is that
>> depending on other libraries probably isn't really an option, so I'm not
>> sure what means.
>> - maybe other cool approaches like https://github.com/droyed/eucl_dist
>> - providing a way to 'tune' distance computations to be optimal to your
>> particular dataset and constraints (e.g. my 'bail early' optimization might
>> be a lot faster or a bit slower, depending on your data ... or you might be
>> OK with approximate matching with a 'low' error rate, etc.)
>>
>> I guess what I'd like to see is a single place where users can get access
>> to everything related to distance metrics and their uses, including all
>> sorts of optimizations etc. (possibly for different hardware, and possibly
>> distributed). To do that well is a pretty big undertaking, and I don't know
>> whether it's suited to scipy - e.g. maybe scipy doesn't really care about
>> distance stuff, or only wants to stick with 'simple' distance metric cases
>> (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a
>> completely new python package - which would probably be a lot easier to
>> develop as I'd have a lot more flexibility (e.g. to depend on other
>> packages, and not have to worry about breaking the scipy API etc.). On the
>> other hand (as discussed in the latest comment on the PR), that might not
>> be best - it might never get used/maintained etc.
>>
>> So, what does the community think is the best approach? I've got too
>> little context of what scipy is and what it's aiming for, and I don't want
>> to head off on the wrong tack. Comments on any of the other implied
>> questions would also be appreciated.
>>
>> Thanks,
>>
>> kodonnell
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180910/148af471/attachment.html>