List behaviours with Clustering Algorithm

Wed Jan 26 04:43:06 EST 2011

James Ravenscroft wrote:

>> > I can't run your code because you didn't make it standalone,
> Thanks for the heads up,  I've made a simple version of the clusterer
> which you can view on pastebin: http://pastebin.com/7HmAkmfj If you have
> time to look through my code I would be very grateful!
> 
> 
> 
>> > but in your case that is probably not enough.
>> > Try something along these lines:
>> >
>> > # untested
>> > while len(self.clusters) > 1:
>> >     c = self.clusters.pop()
>> >     # find closest index
>> >     for i, c2 in enumerate(self.clusters):
>> >         ...
>> >         if ...:
>> >             closest_index = i
>> >     closest = self.clusters.pop(closest_index)
>> >     tmp.append(c + closest)
>> > if self.clusters:
>> >    tmp.append(self.clusters[0])
> I had a go at implementing something along the lines above and I'm still
> getting very bizarre results. There does appear to be some sort of logic
> to it though, if you look at the graph chart, you can see that It seems
> to be doing the clustering right and then forgetting to remove the old
> groupings providing this "cloning" effect for some cluster groups.

I'm sorry I can't infer the intended algorithm from the code you provide. 
Perhaps you can give a short description in plain English?

However, here's the implementation of the algorithm you mention as described 
on wikipedia: 

http://en.wikipedia.org/wiki/Cluster_analysis#Agglomerative_hierarchical_clustering

Repeatedly merge the two closest clusters until there's only one left. 

To keep track of the two merged clusters I've added a "children" key to your 
node dictionaries. The intermediate states of the tree are also put into the 
levels variable (I suppose that's the purpose of your levels attribute).

The main difference to your code is that 

(1) list modifications occur only in the outer while loop, so both inner 
loops can become for-loops again.
(2) test all pairs before choosing the pair

    debug = True
    def cluster(self):
        '''Run the clustering operation on the files
        '''
        clusters = []
        #add a cluster for each track to clusters
        for song in self.music:
            clusters.append({'data' : [song]})
        levels = []
        while len(clusters) > 1:
            # calculate weights
            for c in clusters:
                c["weight"] = self.__getClusterWeight(c['data'])

            # find closest pair
            closestDelta = float('inf')
            closestPair = None
            for i, a in enumerate(clusters):
                for k, b in enumerate(clusters):
                    if i == k:
                        break
                    delta = abs(a['weight'] - b['weight'])
                    if delta < closestDelta:
                        closestPair = i, k
                        closestDelta = delta

            # merge closest pair
            hi, lo = closestPair
            left = clusters[lo]
            right = clusters.pop(hi)
            clusters[lo] = {"data": left["data"] + right["data"], 
"children": [left, right]}

            # keep track of the intermediate tree states
            levels.append(list(clusters))

            if self.debug:
                print ("stage %d" % len(levels)).center(40, "-")
                for c in clusters:
                    print c["data"]

        # store tree
        self.clusters = clusters
        self.levels = levels

Note that there's some potential for optimisation. For example, you could 
move the weight calculation out of the while-loop and just calculate the 
weight for the newly merged node.

Peter