[SciPy-User] SciPy and Recursion

Mon Feb 28 06:20:59 EST 2011

On Mon, Feb 28, 2011 at 09:56:41AM +0100, Sebastian Haase wrote:
> could you explain, what you mean by suboptimal!?
> Do mean speed-wise ?
> I had a longish thread on the numpy list recently, where I was trying
> to gain speed using OpenMP and/or SSE.
> And cdist turned out to as fast as my (best) C implementation (for
> less than 2-3 threads).

I did mean speed-wise: for high-dimensional data, scikit learn can be
significantly faster:

In [1]: X = np.random.random((1000, 500))

In [2]: Y = np.random.random((1000, 500))

In [3]: from scipy import spatial as sp

In [4]: %time sp.distance.cdist(X, Y)
CPU times: user 0.56 s, sys: 0.00 s, total: 0.56 s
Wall time: 1.16 s
Out[5]: 
array([[ 9.14394009,  9.27152238,  8.9976296 , ...,  9.18902138,
         8.63073757,  8.8818356 ],
       [ 9.03243891,  9.37592823,  8.76692936, ...,  9.25943615,
         9.09636773,  8.75653576],
       [ 9.06511143,  8.69746052,  9.12285065, ...,  9.08133078,
         8.93667671,  9.00539463],
       ..., 
       [ 9.35929309,  8.87066188,  9.24649229, ...,  9.4306161 ,
         9.12252869,  9.00311071],
       [ 9.25729667,  8.9454522 ,  9.17794614, ...,  9.30332972,
         9.43599469,  9.00881447],
       [ 9.10675538,  8.67428177,  8.6647222 , ...,  8.89505099,
         9.12760646,  9.01155698]])

In [6]: from scikits.learn.metrics import pairwise

In [7]: %time pairwise.euclidean_distances(X, Y)
CPU times: user 0.17 s, sys: 0.01 s, total: 0.18 s
Wall time: 0.20 s
Out[8]: 
array([[ 9.14394009,  9.27152238,  8.9976296 , ...,  9.18902138,
         8.63073757,  8.8818356 ],
       [ 9.03243891,  9.37592823,  8.76692936, ...,  9.25943615,
         9.09636773,  8.75653576],
       [ 9.06511143,  8.69746052,  9.12285065, ...,  9.08133078,
         8.93667671,  9.00539463],
       ..., 
       [ 9.35929309,  8.87066188,  9.24649229, ...,  9.4306161 ,
         9.12252869,  9.00311071],
       [ 9.25729667,  8.9454522 ,  9.17794614, ...,  9.30332972,
         9.43599469,  9.00881447],
       [ 9.10675538,  8.67428177,  8.6647222 , ...,  8.89505099,
         9.12760646,  9.01155698]])

However, I it does depend on the dimensionality of the data:

In [9]: X = np.random.random((1000, 3))

In [10]: Y = np.random.random((1000, 3))

In [11]: %timeit sp.distance.cdist(X, Y)
100 loops, best of 3: 11.9 ms per loop

In [12]: %timeit pairwise.euclidean_distances(X, Y)
10 loops, best of 3: 35.4 ms per loop

and juging by David's question, he was probably operating with 3D data:

> > On Sat, Feb 26, 2011 at 02:32:59PM -0800, David Baddeley wrote:
> >> now got scipy running - you're going to want:

> >> dist_list = sp.distance.cdist(cluster_shifted, xyz.reshape((-1, 3)))

So, I must apologies, I answer off-topic: David you probably should be
using scipy spatial.

Gael