[SciPy-Dev] Improving spatial.distance code and, performance

Kane & Anna O'Donnell email.the.odonnells at gmail.com
Sat Sep 15 04:09:20 EDT 2018


Sorry if this doesn't work - I haven't used mailing lists before, and 
got the digest, which doesn't seem to allow me to reply to a specific 
thread. So I'm attempting to set the In-Reply-To header manually. Sorry 
if that's silly.

Anyway:

Yes, I believe there are good unit tests (Ralf actually ran them I 
think). From the basis of this, and the feedback in the issue/PR, it 
sounds like the preference would be for me to upgrade the distance 
metrics to cython bit-by-bit.

However, I'm still unsure about scipy vs a new package, since scipy 
apparently won't support dependencies. For example, I could use Faiss or 
flann, but scipy wouldn't add them as a dependency. With that in mind, 
it seems the only option for offering users full-functionality (aside 
from rewriting those libraries from scratch etc.) is to offer a package. 
Unless scipy is willing to specify a run-time dependency (not sure of 
the correct terminology) - e.g. scipy doesn't depend on Faiss (when you 
install it), but if the user calls a distance metric with e.g. 
method='faiss', it will fail if faiss cannot be imported. Is this a 
possibility (and a good idea)?


On 2018/09/12 4:00 am, scipy-dev-request at python.org wrote:
> Send SciPy-Dev mailing list submissions to
> 	scipy-dev at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://mail.python.org/mailman/listinfo/scipy-dev
> or, via email, send a message with subject or body 'help' to
> 	scipy-dev-request at python.org
>
> You can reach the person managing the list at
> 	scipy-dev-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of SciPy-Dev digest..."
>
>
> Today's Topics:
>
>     1. Re: Improving spatial.distance code and performance
>        (Mark Alexander Mikofski)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 10 Sep 2018 10:05:17 -0700
> From: Mark Alexander Mikofski <mikofski at berkeley.edu>
> To: SciPy Developers List <scipy-dev at python.org>
> Subject: Re: [SciPy-Dev] Improving spatial.distance code and
> 	performance
> Message-ID:
> 	<CAEqRcW3x6qZ6X_kpg2U54O7=E4+S6UtkoM3pFQRv8PzQQBCw+w at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I'm very interested to see how a successful cython/performance PR
> progresses from a reviewers standpoint.
>
> On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy at gmail.com> wrote:
>
>> Good to see some activity / interest in spatial.
>>
>> Definitely agree with Ralf's github comments re: using smaller / more
>> tractable PRs -- it really is tough to sit down at night with 30 minutes of
>> free time or whatever and look at a massive diff & not want to give up.
>>
>> I like the idea of making small / targeted / clearly demonstrated
>> performance improvements without overhauling the entire infrastructure
>> first, but maybe that's a controversial view if it just causes too much
>> heterogeneity.
>>
>> Presumably the affected code is all thoroughly covered by unit tests?
>> That's an important pre-requisite to have the confidence to really start
>> making changes, esp. with older low-level code like that.
>>
>> On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell <
>> email.the.odonnells at gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> TLDR; I'm wondering about a) porting the spatial.distance code to Cython,
>>> and b) adding some performance optimizations, and I'd like the dev
>>> community's input/feedback.
>>>
>>> For context (though I don't really need you to read these to give the
>>> feedback I'm looking for),
>>>
>>> - original issue/proposal: https://github.com/scipy/scipy/issues/9205
>>> - PR: https://github.com/scipy/scipy/pull/9218
>>>
>>> Before submitting the PR, I naively thought it was going to be nice
>>> Cython or similar. Turns out it's some pretty old code, that I found pretty
>>> hard to wrap my head around and understand. I eventually figured it out
>>> after spending ages debugging a nastily hidden 'bug', and demonstrated the
>>> performance optimizations, but it prompted the discussion about whether it
>>> was best to port everything to Cython first.
>>>
>>> *Existing stuff to Cython*
>>>
>>> Doing so shouldn't be too hard, and it shouldn't change any
>>> functionality, except to replace the distance functions with their Cython
>>> ones (instead of the current approach, where the distance functions are
>>> actually numpy things, and there's not supported access to the underlying C
>>> stuff). A few possible 'bugs' (as above) should hopefully become non-issues
>>> too. So, it'd be a win for performance (e.g. all the distance functions
>>> will be much faster), and code quality, and future maintainability and
>>> development. However, things to think about:
>>>
>>> - should I just replace like-for-like, or consider some nicer OOP stuff
>>> like e.g. sklearn's approach (which is already Cython)?
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx
>>> (I'm guessing the reason they rolled their own was because they found scipy
>>> lacking, as above.) In fact, we could largely just copy that file over. Not
>>> sure about the interplay between scipy and scikit learn though.
>>> - what's the best way to structure a PR? One metric at a time, or the
>>> whole caboodle?
>>>
>>> *Adding performance optimizations and future functionality*
>>>
>>> As per the links, this was initially about providing a nifty performance
>>> hack. It should still be pretty easy to implement. Personally, I think it
>>> makes sense to implement after the Cythonization - unless the community are
>>> against that.
>>>
>>> However, there are other possibilities:
>>>
>>> - using 'indices' within calculations. E.g. when doing pdist, it might
>>> pay to use a tree of some description. I also proposed another 'index' to
>>> optimize the 'bail early' approach further (which would, I think, actually
>>> work well with trees too). This would involve more API changes, possibly
>>> significant.
>>> - using approximate searches (e.g. Faiss). My understanding is that
>>> depending on other libraries probably isn't really an option, so I'm not
>>> sure what means.
>>> - maybe other cool approaches like https://github.com/droyed/eucl_dist
>>> - providing a way to 'tune' distance computations to be optimal to your
>>> particular dataset and constraints (e.g. my 'bail early' optimization might
>>> be a lot faster or a bit slower, depending on your data ... or you might be
>>> OK with approximate matching with a 'low' error rate, etc.)
>>>
>>> I guess what I'd like to see is a single place where users can get access
>>> to everything related to distance metrics and their uses, including all
>>> sorts of optimizations etc. (possibly for different hardware, and possibly
>>> distributed). To do that well is a pretty big undertaking, and I don't know
>>> whether it's suited to scipy - e.g. maybe scipy doesn't really care about
>>> distance stuff, or only wants to stick with 'simple' distance metric cases
>>> (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a
>>> completely new python package - which would probably be a lot easier to
>>> develop as I'd have a lot more flexibility (e.g. to depend on other
>>> packages, and not have to worry about breaking the scipy API etc.). On the
>>> other hand (as discussed in the latest comment on the PR), that might not
>>> be best - it might never get used/maintained etc.
>>>
>>> So, what does the community think is the best approach? I've got too
>>> little context of what scipy is and what it's aiming for, and I don't want
>>> to head off on the wrong tack. Comments on any of the other implied
>>> questions would also be appreciated.
>>>
>>> Thanks,
>>>
>>> kodonnell
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at python.org
>>> https://mail.python.org/mailman/listinfo/scipy-dev
>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180910/148af471/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
>
> ------------------------------
>
> End of SciPy-Dev Digest, Vol 179, Issue 12
> ******************************************



More information about the SciPy-Dev mailing list