[SciPy-Dev] ENH: add distance gufuncs to `scipy.spatial`

Thu Dec 19 01:55:26 EST 2013

Hi,

I have just submitted a quite bulky pull request (
https://github.com/scipy/scipy/pull/3163), that deserves a lengthier
explanation... It basically adds a new submodule, umath_distance, to
scipy.spatial. This module reproduces some of the functionality in the
distance submodule, using gufuncs, and thus adding all the broadcasting
sweetness that comes with them.

Some examples
=============
>>> import numpy as np
>>> import scipy.spatial.umath_distance as udist
>>> a = np.random.rand(3, 10)
>>> b = np.random.rand(4, 10)

Compute the distance between corresponding vectors in two multidimensional
arrays:

>>> udist.euclidean(a, b[:3])
array([ 1.25080976,  1.00788142,  0.89158743])

Compute all distances between to sets of vectors, i.e. what distance.cdist
does, using broadcasting...

>>> udist.euclidean(a[:, np.newaxis, :], b)
array([[ 1.25080976,  1.20893157,  1.63674343,  1.62432255],
       [ 1.36665537,  1.00788142,  1.10137183,  1.29196888],
       [ 1.24924713,  1.27379914,  0.89158743,  1.317478  ]])

...or a convenience argument in the Python wrapper:

>>> udist.euclidean(a, b, cdist=True)
array([[ 1.25080976,  1.20893157,  1.63674343,  1.62432255],
       [ 1.36665537,  1.00788142,  1.10137183,  1.29196888],
       [ 1.24924713,  1.27379914,  0.89158743,  1.317478  ]])

If a single input is given, compute all pairwise distances, i.e. what
distance.pdist does:

>>> udist.euclidean(a)
array([ 1.59655565,  1.73136646,  0.99252024])

And check the result is correct with a cdist-style call:

>>> udist.euclidean(a, a, cdist=True)
array([[ 0.        ,  1.59655565,  1.73136646],
       [ 1.59655565,  0.        ,  0.99252024],
       [ 1.73136646,  0.99252024,  0.        ]])

What's included
===============
So far only the distance functions of only two inputs are included, which
are all in scipy.spatial.distance except mahalanobis, minkowski, seuclidean
and wminkowski. If you people like this gufunc approach, I will happily
expand it to include those as well. The code so far is very modular, both
at the C and Python level, and makes it relatively easy to add a new
distance function from a C function doing a single distance calculation
between two vectors. I haven't figured out all the details, but I think it
would be possible to move the gufunc kernels to a header file, and make it
easy(er) to add a new distance gufunc, perhaps even from Cython, in a
similar way to how it can be done with PyUFunc_dd_d and the like.

I have also spent a fair amount of time checking the definition of the
implemented distances from original sources, and I think I have found some
errors there, both in the definitions (Sokal-Michener is wrong), and in the
way pathological edge cases are handled. The C code is heavily commented on
the subject, starting around line 154.

Performance
===========
For the existing cdist and pdist functions, replacing inline functions with
function pointers does take a toll on the performance of the gufunc
approach, which is worse the smaller the dimension of the vectors being
compared, i.e. the less work each function call does:

>>> import timeit
>>> import scipy.spatial.distance as dist
>>> a = np.random.rand(1000, 3)
>>> b = np.random.rand(2000, 3)
>>> min(timeit.repeat("dist.cdist(a, b, metric='euclidean')", 'from
__main__ import dist, a, b', number=1, repeat=100))
0.017471377426346635
>>> min(timeit.repeat("udist.euclidean(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.023448978561347644

>>> a = np.random.rand(1000, 100)
>>> b = np.random.rand(2000, 100)
>>> min(timeit.repeat("dist.cdist(a, b, metric='euclidean')", 'from
__main__ import dist, a, b', number=1, repeat=100))
0.20550328478645952
>>> min(timeit.repeat("udist.euclidean(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.22972938023184497

On the other hand, the C implementations in distance of most of the boolean
functions are grossly inefficient, so there are significant speed-ups there
despite the extra overhead:

>>> a = np.random.rand(1000, 100) < 0.5
>>> b = np.random.rand(2000, 100) < 0.5
>>> min(timeit.repeat("dist.cdist(a, b, metric='sokalsneath')", 'from
__main__ import dist, a, b', number=1, repeat=100))
1.43636283024901
>>> min(timeit.repeat("udist.sokalsneath(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.3878355139542009

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20131218/208e3931/attachment.html>