[Numpy-discussion] Fastest distance matrix calc
Timothy Hochberg
tim.hochberg at ieee.org
Fri Apr 13 09:49:00 EDT 2007
On 4/13/07, Bill Baxter <wbaxter at gmail.com> wrote:
>
> I think someone posted some timings about this before but I don't recall.
>
> The task is to compute the matrix D from two sets of vectors x (M,d)
> and y (N,d).
> The output should be D where D[i,j] is norm(x[i]-y[j])
>
> The Matlab NetLab toolkit uses something like this to compute it:
>
> d2 = (x*x).sum(1)[:,numpy.newaxis] + (y*y).sum(1) - 2*mult(x,y.T)
>
> And then does
> d2[d2<0]=0
> because roundoff in the above can sometimes create negative values. And
> finally
> d = sqrt(d2)
>
> But it just doesn't seem like the best way to do it. Whatever was
> posted before I don't remember anything about a subtract.outer
> solution. Seems like something along the lines of
> (subtract.outer(x,y)**2).sum(axis=0)
> might be faster, and also avoid the need to check for negatives.
>
> I'll do some timings if no one's already done it. Just wanted to check
> first.
I'm going to go out on a limb and contend, without running any timings, that
for large M and N, a solution using a for loop will beat either of those.
For example (untested):
results = empty([M, N], float)
# You could be fancy and swap axes depending on which array is larger, but
# I'll leave that for someone else
for i, v in enumerate(x):
results[i] = sqrt(sum((v-y)**2, axis=-1))
Or something like that. The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.
Or you could use numexpr.
--
//=][=\\
tim.hochberg at ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070413/88ed0cd9/attachment.html>
More information about the NumPy-Discussion
mailing list