[SciPy-user] Scipy for Cluster Detection

Thu Dec 27 01:24:26 EST 2007

On 26-Dec-07, at 7:28 PM, Lorenzo Isella wrote:

> Dear All,
> I hope this will not sound too off-topic.
> Without going into details, I am simulating the behavior of a set of
> particles interacting with a short-ranged, strongly binding  
> potential. I
> start with a number N of particles which undergo collisions with each
> other and, in doing so, they stick giving rise to clusters.
> The code which does that is not written in Python, but it returns the
> positions of these particles in space.
> As time goes on, more and more particles coagulate around one of these
> clusters. Eventually, they will all end up in the same cluster.
> Two particles are considered to be bounded if their distance falls  
> below
> a certain threshold d.
> A cluster is nothing else than a group of particles all directly or
> indirectly bound together.

This doesn't sound like what would typically be considered a

Do you have code that produces the (N^2 - N)/2 distances between pairs  
of particles? How large is N? If it's small enough, turning it into a  
NumPy array should be straightforward, and then thresholding at d is  
one line of code. Then what you end up with is a boolean matrix that  
where the (i,j) entry denotes that i and j are in a cluster together.  
This can be thought of as a graph or network where the nodes are  
particles and connections / links / edges exist between those  
particles that are in a cluster together.

Once you have this your problem is what graph theorists would call  
identifying the "connected components", which is a well-studied  
problem and can be done fairly easily, Google "finding connected  
components" or pick up any algorithms textbook at your local library  
such as the Cormen et al "Introduction to Algorithms".

I'm not in a position to comment on whether there's something like  
this already in SciPy, as I really don't know. However, finding  
connected components of a graph (i.e. your clusters) is a very well  
studied problem in the computer science (namely graph theory)  
literature, and the algorithms are simple enough that less than 20  
lines of Python should do the trick provided the distance matrix is  
manageable.

Regards,

David