[SciPy-Dev] scipy.spatial comments

Sat Mar 10 20:35:51 EST 2012

On 2012-03-10, at 4:53 AM, Ralf Gommers wrote:

> Second, squared Euclidean distance is computed by taking the square of
> the Euclidean distance. I think it would make more sense to do it the
> other way around: the Euclidean distance is computed by taking the
> square root of the squared Euclidean distance.
> 
> Makes sense, should be a little faster.

Actually, the current implementation is absolutely crazy, especially considering that SciPy has easy access to BLAS. One should never be computing Euclidean distances naively like is done in distance.c.

In [36]: def euclidean_distance(X, Y):
   ....:     x2 = (X**2).sum(axis=1)
   ....:     y2 = (Y**2).sum(axis=1)
   ....:     dist = np.dot(X, Y.T)
   ....:     dist *= -2
   ....:     dist += x2[:, np.newaxis]
   ....:     dist += y2
   ....:     return np.sqrt(dist)
   ....: 

In [37]: from scipy.spatial.distance import cdist

In [38]: x = np.random.randn(500, 50)

In [39]: y = np.random.randn(400, 50)

In [40]: timeit euclidean_distance(x, y)
100 loops, best of 3: 7.57 ms per loop

In [41]: timeit cdist(x, y, 'euclidean')
100 loops, best of 3: 16.8 ms per loop

In [42]: np.allclose(euclidean_distance(x, y), cdist(x, y, 'euclidean'))
Out[42]: True

So, a Python implementation that calls NumPy is 2x faster (on my machine with EPD/MKL). A C/Cython implementation that called GEMM directly and didn't iterate over the data multiple times or allocate temporaries would be even faster.

David