[Numpy-discussion] Faster

Sun May 4 21:53:44 EDT 2008

On Sun, May 4, 2008 at 5:59 PM, Damian R. Eads <eads at lanl.gov> wrote:
> Hi,
>
>  Looks like a fun discussion: it's too bad for me I did not join it
>  earlier. My first try at scipy-cluster was completely in Python. Like you,
>  I also tried to find the most efficient way to transform the distance
>  matrix when joining two clusters. Eventually my data sets became big
>  enough that I decided to write these parts in C. I don't think my Python
>  joining code was efficient as yours.
>
>  I tried out your first test and I am a little confused at the output.

Thank you for catching that bug. It was introduced on the last speed
up. I pasted code from the list that used i as the index. But in the
cluster code I should delete the j-th row and column. After fixing
that the output matches

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

I should convert that to a unit test.

>  You may wish to look at the dendrogram since it is easier to interpret,
>  http://www.soe.ucsc.edu/~eads/goodman-dendro.png

Nice.

>  I tried running your second test, and you'll see C might give you a better
>  performance speed-up, which is not surprising. Roughly, what I'm doing in
>  C is I'm only storing the upper triangular of the distance matrix. An
>  array of double*'s (double **) refers to each row of this triangle. To
>  eliminate a row, I simply remove the entry in the double ** array. To
>  remove a column, I shift the values over in each row of the triangle. I'm
>  not sure if this is the best approach but it is certainly more efficient
>  than vectorized python, in terms of both memory and computation.
>
>  In [107]: goodman.test2(1000)
>  n = 1000 took 22.10 seconds
>  In [108]: n=1000
>  In [109]: uu=numpy.random.rand(n*(n-1)/2)
>  In [110]: tic = time.time(); hcluster.single(uu); toc = time.time(); print
>  toc-tic
>  Out[110]:
>  4.57607889175

To be (almost) within a factor of 5 is great! But if I get anything
close to useful from my code, I'll move over to your industrial
strength code.

Here's the bug-fixed version:

import time

import numpy as np

class Cluster:
    "Single linkage hierarchical clustering"

    def __init__(self, dist, label=None, linkage='single', debug=False):
        """
        dist   Distance matrix, NxN numpy array
        label  Labels of each row of the distance matrix, list of N items,
               default is range(N)
        """
        assert dist.shape[0] == dist.shape[1], 'dist must be square (nxn)'
        assert (np.abs(dist - dist.T) < 1e-8).all(), 'dist must be symmetric'
        if label is None:
            label = range(dist.shape[0])
        assert dist.shape[0] == len(label), 'dist and label must match in size'
        msg = 'linkage must be single or complete'
        assert linkage in ('single', 'complete'), msg
        self.c = [[[z] for z in label]]
        self.label = label
        self.linkage = linkage
        self.dist = dist
        self.debug = debug

    def run(self):
        for level in xrange(len(self.label) - 1):
            i, j = self.min_dist()
            self.join(i, j)

    def join(self, i, j):

        assert i != j, 'Cannot combine a cluster with itself'

        # Join labels
        new = list(self.c[-1])
        new[i] = new[i] + new[j]
        new.pop(j)
        self.c.append(new)

        # Join distance matrix
        if self.linkage == 'single':
            self.dist[:,i] = self.dist[:,[i,j]].min(1)
        elif self.linkage == 'complete':
            self.dist[:,i] =  self.dist[:,[i,j]].max(1)
        else:
            raise NotImplementedError, 'Unknown linkage method'
        self.dist[i,:] = self.dist[:,i]
        # A faster verion of this code...
        #idx = range(self.dist.shape[1])
        #idx.remove(j)
        #self.dist = self.dist[:,idx]
        #self.dist = self.dist[idx,:]
        # ...is this...
        out = self.dist[:-1,:-1]
        out[:j,j:] = self.dist[:j,j+1:]
        out[j:,:j] = self.dist[j+1:,:j]
        out[j:,j:] = self.dist[j+1:,j+1:]
        self.dist = out

        # Debug output
        if self.debug:
            print
            print len(self.c) - 1
            print 'Clusters'
            print self.c[-1]
            print 'Distance'
            print self.dist

    def min_dist(self):
        # A faster version of this code...
        # dist = self.dist + 1e10 * np.eye(self.dist.shape[0])
        # i, j = np.where(dist == dist.min())
        # return i[0], j[0]
        # ...is this:
        x = self.dist
        N = x.shape[0]
        # With complete linkage the min distance was sometimes on the diagonal
        # I think it occured on the last merge (one cluster). So I added + 1.
        x.flat[::N+1] = x.max() + 1
        ij = x.argmin()
        i, j = divmod(ij, N)
        return i, j

def test():
    # Example from
    # home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
    label =          ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
    dist = np.array([[0,    662,  877,  255,  412,  996],
                     [662,  0,    295,  468,  268,  400],
                     [877,  295,  0,    754,  564,  138],
                     [255,  468,  754,  0,    219,  869],
                     [412,  268,  564,  219,  0,    669],
                     [996,  400,  138,  869,  669,  0  ]])
    clust = Cluster(dist, label, linkage='single', debug=True)
    clust.run()

def test2(n):
    x = np.random.rand(n,n)
    x = x + x.T
    c = Cluster(x)
    t1 = time.time()
    c.run()
    t2 = time.time()
    print 'n = %d took %0.2f seconds' % (n, t2-t1)