List behaviours with Clustering Algorithm

Mon Jan 24 05:02:55 EST 2011

Dear All,

I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list  behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated

The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).

In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.

Please see below:

#-----------------------------Code Snippet
-----------------------------------------------#

    def cluster(self):
        '''Run the clustering operation on the files
        '''

        #add a cluster for each track to clusters
        for song in self.music.keys():
            self.clusters.append({ 'data' : [song], 'weight' : long(0) })

        currentLevel = 0 #current level of hierarchy

        #while there is more than one cluster in the sorting bank
        # i.e. we haven't achieved hierachy yet, run the algorithm
        while( len(self.clusters) > 1):

            print "Currently at Level %d" % currentLevel

            print "There are %d clusters at this level" % len(self.clusters)

            #store the level in the levels array
            self.levels.append(self.clusters)

            #empty the clusters list
            self.clusters = []

            #find the weight of each current cluster
            for c in self.levels[currentLevel]:

                self.clusters.append({'data' : c['data'],
                                     'weight' :
self.__getClusterWeight(c['data'])})

            #do the actual clustering

            tmp = []

            for c in self.clusters:

                closestCluster = None
                closestDelta = float('inf')

                for c2 in self.clusters:

                    #skip if this is the same node
                    if(c == c2):
                        continue

                    delta = abs(c2['weight'] - c['weight'])

                    if(delta < closestDelta):
                        closestCluster = c2
                        closestDelta = delta

                print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
                                                              'c2' :
self.clusters.index(closestCluster)}
                #now merge the two clusters
                self.clusters.remove(closestCluster)
                self.clusters.remove(c)

                c['data'].extend(closestCluster['data'])

                tmp.append(c)

            #increment the level of the hierarchy

            self.clusters = tmp

            currentLevel += 1

#--------------------------------End Code Snippet
----------------------------#

Thanks,

James

-- 
James Ravenscroft
http://blog.funkymonkeysoftware.com/