List behaviours with Clustering Algorithm

James Ravenscroft james at funkymonkeysoftware.com
Mon Jan 24 05:02:55 EST 2011


Dear All,

I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list  behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated

The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).

In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.

Please see below:

#-----------------------------Code Snippet
-----------------------------------------------#

    def cluster(self):
        '''Run the clustering operation on the files
        '''
       
        #add a cluster for each track to clusters
        for song in self.music.keys():
            self.clusters.append({ 'data' : [song], 'weight' : long(0) })
           
        currentLevel = 0 #current level of hierarchy
           
        #while there is more than one cluster in the sorting bank
        # i.e. we haven't achieved hierachy yet, run the algorithm
        while( len(self.clusters) > 1):
           
            print "Currently at Level %d" % currentLevel
           
            print "There are %d clusters at this level" % len(self.clusters)
           
            #store the level in the levels array
            self.levels.append(self.clusters)
           
            #empty the clusters list
            self.clusters = []
           
            #find the weight of each current cluster
            for c in self.levels[currentLevel]:
               
                self.clusters.append({'data' : c['data'],
                                     'weight' :
self.__getClusterWeight(c['data'])})
               
               
               
            #do the actual clustering
           
            tmp = []
           
            for c in self.clusters:
               
                closestCluster = None
                closestDelta = float('inf')
               
                for c2 in self.clusters:
                   
                    #skip if this is the same node
                    if(c == c2):
                        continue
                   
                    delta = abs(c2['weight'] - c['weight'])
                   
                    if(delta < closestDelta):
                        closestCluster = c2
                        closestDelta = delta
                       
               
                print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
                                                              'c2' :
self.clusters.index(closestCluster)}
                #now merge the two clusters
                self.clusters.remove(closestCluster)
                self.clusters.remove(c)
               
                c['data'].extend(closestCluster['data'])
               
                tmp.append(c)

            #increment the level of the hierarchy
           
            self.clusters = tmp
           
            currentLevel += 1

#--------------------------------End Code Snippet
----------------------------#

Thanks,

James

-- 
James Ravenscroft
http://blog.funkymonkeysoftware.com/




More information about the Python-list mailing list