List behaviours with Clustering Algorithm

Mon Jan 24 06:43:34 EST 2011

James Ravenscroft wrote:

> Dear All,
> 
> I am currently trying to write a simple Agglomerative Clustering
> algorithm which sorts through my MP3 collection and uses associated
> Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
> some trouble with my algorithm and some tracks are ending up in multiple
> clusters. I have a feeling this is because of (my lack of understanding
> of) list  behaviours within Python. This is all purely a learning
> exercise, so any input would be greatly appreciated
> 
> The actual clustering method can be seen described here. Each Cluster is
> stored within the 'clusters' array and has two elements: the data
> containing song metadata such as artist, title and most importantly
> tags, and a weight: that is, the overall Euclidean weight of the cluster
> (based on the song data within).
> 
> In theory, the algorithm runs in a loop until all clusters have been
> merged into a hierarchy. It takes the weight of each cluster and merges
> each cluster with their closest counterpart. My code has a lot of
> comments so it should be fairly easy to understand.

>             tmp = []
>            
>             for c in self.clusters:
>                
>                 closestCluster = None
>                 closestDelta = float('inf')
>                
>                 for c2 in self.clusters:
>                    
>                     #skip if this is the same node
>                     if(c == c2):
>                         continue
>                    
>                     delta = abs(c2['weight'] - c['weight'])
>                    
>                     if(delta < closestDelta):
>                         closestCluster = c2
>                         closestDelta = delta
>                        
>                
>                 print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
> self.clusters.index(c),
>                                                               'c2' :
> self.clusters.index(closestCluster)}
>                 #now merge the two clusters
>                 self.clusters.remove(closestCluster)
>                 self.clusters.remove(c)
>                
>                 c['data'].extend(closestCluster['data'])
>                
>                 tmp.append(c)

I can't run your code because you didn't make it standalone, but I believe 
that the problem (at least one of them) is that you iterate over 
self.clusters and modify it from within the loop. That's a big no-no in 
python. A simple example to demonstrate the effects:

>>> import random
>>> old = range(10)
>>> new = []
>>> for item in old:
...     closest = random.choice(old)
...     new.append((item, closest))
...     old.remove(item)
...     old.remove(closest)
...
>>> old
[3, 4]
>>> new
[(0, 8), (2, 1), (5, 7), (9, 6)]

The remedy is normally to iterate over a copy

for item in list(old):
   ...

but in your case that is probably not enough. 
Try something along these lines:

# untested
while len(self.clusters) > 1:
    c = self.clusters.pop()
    # find closest index
    for i, c2 in enumerate(self.clusters):
        ...
        if ...:
            closest_index = i
    closest = self.clusters.pop(closest_index)
    tmp.append(c + closest)
if self.clusters:
   tmp.append(self.clusters[0])

Peter