[SciPy-Dev] ENH: add distance gufuncs to `scipy.spatial`
Jaime Fernández del Río
jaime.frio at gmail.com
Thu Dec 19 01:55:26 EST 2013
Hi,
I have just submitted a quite bulky pull request (
https://github.com/scipy/scipy/pull/3163), that deserves a lengthier
explanation... It basically adds a new submodule, umath_distance, to
scipy.spatial. This module reproduces some of the functionality in the
distance submodule, using gufuncs, and thus adding all the broadcasting
sweetness that comes with them.
Some examples
=============
>>> import numpy as np
>>> import scipy.spatial.umath_distance as udist
>>> a = np.random.rand(3, 10)
>>> b = np.random.rand(4, 10)
Compute the distance between corresponding vectors in two multidimensional
arrays:
>>> udist.euclidean(a, b[:3])
array([ 1.25080976, 1.00788142, 0.89158743])
Compute all distances between to sets of vectors, i.e. what distance.cdist
does, using broadcasting...
>>> udist.euclidean(a[:, np.newaxis, :], b)
array([[ 1.25080976, 1.20893157, 1.63674343, 1.62432255],
[ 1.36665537, 1.00788142, 1.10137183, 1.29196888],
[ 1.24924713, 1.27379914, 0.89158743, 1.317478 ]])
...or a convenience argument in the Python wrapper:
>>> udist.euclidean(a, b, cdist=True)
array([[ 1.25080976, 1.20893157, 1.63674343, 1.62432255],
[ 1.36665537, 1.00788142, 1.10137183, 1.29196888],
[ 1.24924713, 1.27379914, 0.89158743, 1.317478 ]])
If a single input is given, compute all pairwise distances, i.e. what
distance.pdist does:
>>> udist.euclidean(a)
array([ 1.59655565, 1.73136646, 0.99252024])
And check the result is correct with a cdist-style call:
>>> udist.euclidean(a, a, cdist=True)
array([[ 0. , 1.59655565, 1.73136646],
[ 1.59655565, 0. , 0.99252024],
[ 1.73136646, 0.99252024, 0. ]])
What's included
===============
So far only the distance functions of only two inputs are included, which
are all in scipy.spatial.distance except mahalanobis, minkowski, seuclidean
and wminkowski. If you people like this gufunc approach, I will happily
expand it to include those as well. The code so far is very modular, both
at the C and Python level, and makes it relatively easy to add a new
distance function from a C function doing a single distance calculation
between two vectors. I haven't figured out all the details, but I think it
would be possible to move the gufunc kernels to a header file, and make it
easy(er) to add a new distance gufunc, perhaps even from Cython, in a
similar way to how it can be done with PyUFunc_dd_d and the like.
I have also spent a fair amount of time checking the definition of the
implemented distances from original sources, and I think I have found some
errors there, both in the definitions (Sokal-Michener is wrong), and in the
way pathological edge cases are handled. The C code is heavily commented on
the subject, starting around line 154.
Performance
===========
For the existing cdist and pdist functions, replacing inline functions with
function pointers does take a toll on the performance of the gufunc
approach, which is worse the smaller the dimension of the vectors being
compared, i.e. the less work each function call does:
>>> import timeit
>>> import scipy.spatial.distance as dist
>>> a = np.random.rand(1000, 3)
>>> b = np.random.rand(2000, 3)
>>> min(timeit.repeat("dist.cdist(a, b, metric='euclidean')", 'from
__main__ import dist, a, b', number=1, repeat=100))
0.017471377426346635
>>> min(timeit.repeat("udist.euclidean(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.023448978561347644
>>> a = np.random.rand(1000, 100)
>>> b = np.random.rand(2000, 100)
>>> min(timeit.repeat("dist.cdist(a, b, metric='euclidean')", 'from
__main__ import dist, a, b', number=1, repeat=100))
0.20550328478645952
>>> min(timeit.repeat("udist.euclidean(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.22972938023184497
On the other hand, the C implementations in distance of most of the boolean
functions are grossly inefficient, so there are significant speed-ups there
despite the extra overhead:
>>> a = np.random.rand(1000, 100) < 0.5
>>> b = np.random.rand(2000, 100) < 0.5
>>> min(timeit.repeat("dist.cdist(a, b, metric='sokalsneath')", 'from
__main__ import dist, a, b', number=1, repeat=100))
1.43636283024901
>>> min(timeit.repeat("udist.sokalsneath(a, b, cdist=True)", 'from __main__
import udist, a, b', number=1, repeat=100))
0.3878355139542009
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20131218/208e3931/attachment.html>
More information about the SciPy-Dev
mailing list