[Numpy-discussion] Fastest distance matrix calc

Keir Mierle mierle at gmail.com
Mon Apr 16 19:31:55 EDT 2007


On 4/13/07, Timothy Hochberg <tim.hochberg at ieee.org> wrote:
> On 4/13/07, Bill Baxter <wbaxter at gmail.com> wrote:
> > I think someone posted some timings about this before but I don't recall.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498246

[snip]
> I'm going to go out on a limb and contend, without running any timings, that
> for large M and N, a solution using a for loop will beat either of those.
> For example (untested):
>
>  results = empty([M, N], float)
> # You could be fancy and swap axes depending on which array is larger, but
> # I'll leave that for someone else
> for i, v in enumerate(x):
>     results[i] = sqrt(sum((v-y)**2, axis=-1))
>  Or something like that. The reason that I suspect this will be faster is
> that it has better locality, completely finishing a computation on a
> relatively small working set before moving onto the next one. The one liners
> have to pull the potentially large MxN array into the processor repeatedly.

In my experience, it is indeed the case that the for loop version is
faster. The fastest of the three versions offered in the above url is
the last:

from numpy import mat, zeros, newaxis
def calcDistanceMatrixFastEuclidean2(nDimPoints):
    nDimPoints = array(nDimPoints)
    n,m = nDimPoints.shape
    delta = zeros((n,n),'d')
    for d in xrange(m):
        data = nDimPoints[:,d]
        delta += (data - data[:,newaxis])**2
    return sqrt(delta)

This is easily extended to two different nDimPoints matricies.

Cheers,
Keir



More information about the NumPy-Discussion mailing list