[SciPy-user] Looking for a way to cluster data

Mon Apr 27 05:56:55 EDT 2009

Hi Zach,

Many thanks for your thoughts. I've had a good go at this myself now and 
am getting some useful results, although the processing is a bit slow at 
the moment. The approach I took is as follows:
Using scipy.ndimage I label the 26-connected regions so that I can 
process each connected branch-like structure in turn. For each of these, 
I form a graph with nodes corresponding to pixels and edges 
corresponding to the Minimum Spanning Tree weighted by euclidean 
distance. I then use NetworkX to traverse this graph, extracting the 
node visitation order as a single list. Whenever this list makes jumps 
of distance >sqrt(2), it is jumping to a different branch, so I identify 
these jumps and subdivide the list into a list of lists, each containing 
nodes containing only one branch. I then plot these using plot3d.

Problems with this approach are that generating the Minimum Spanning 
Tree in pure Python is slow - It turns out for the example I'm trying 
that I have one large region containing 25000 pixels. For the moment 
I've ignored this region and am only dealing with the smaller region 
which are only up to a few hundred pixels. I may need to look at Cython 
for this if I pursue this approach.
I've made a few in-line responses to your comments.

Zachary Pincus wrote:
> Hi Gary,
<snip>
> It sounds like what you want from the clustering is to get groups of
> pixels that form straight-ish lines (and could thus be visualized
> with 3D rods). Is this correct?

Actually, I probably want curved lines so your later comments and 
suggestions about fitting poly-lines are what I may try next.

> Most clustering algorithms are likely
> to just give you groups of pixels that are nearby spatially -- which
> is probably not exactly what you want, since if you (say) have a long
> rod- structure, you'd want all the pixels in that rod grouped
> together, and not grouped with separate rods that cross nearby in
> space but aren't physically connected. So if you do want to cluster
> the pixels, you'll need to use the geodesic distance between two
> pixels, not the euclidian distance. But that still wouldn't give you
> sets of rods... more like rods and blobs at the junctions.

I'm not sure what you mean by geodesic distance here. It's true that I 
have unwanted connections due to the proximity of regions to one 
another, but without some very fancy code to detect these cases, I think 
I'll be satisfied with my rough approximate method. I know one way of 
doing this is to look at the behaviour of tangent vectors to the curves 
but this is impractical in pure Python and probably not worth the effort 
for my purposes.

> Another option would be to try to detect junction points on the 
> skeleton (I think this is a common operation -- sometimes in 2D
> people brute-force it by looking for all possible patterns of pixels
> indicative of branching, but there's probably a nicer way,
> especially for 3d). Once the junctions are known, a simple flood-fill
> along the skeleton gives you sets of pixels that lie between each
> junction.
> 
> Having such a connectivity graph is a useful thing -- lots of neat 
> stuff one can calculate from that. In fact, I think you'd need to do
> this to calculate geodesic distances anyway... Anyhow, you could
> then try fitting the points from each branch to a poly-line (using
> some information criterion for determining how many lines to use), to
> simplify the representation down to rods for plotting.

I'm planning to do this next. I'm hoping I can think of a very simple 
criterion.

> But perhaps it's worth figuring out if there's another visualization
> method you could use, before going to all this effort... Would
> volume rendering work?

It may. I haven't tried volume rendering yet, but rendering tubes is the 
ideal. I have tried just isosurfacing the detected pixels and this isn't 
so bad, but tubes give the eye extra depth cues that I'm hoping to 
exploit to make things look more informative.

> Perhaps you could go as far as
> junction-point-finding and branch-labeling. Just doing that would let
> you exclude short branches (or short terminal branches) and then you
> could simply volume- render the rest and not fiddle with
> line-fitting.

I'll think about this too, but I may try straight volume rendering first 
and see how it looks.

> Zach

Thanks for taking the time to respond.

regards,
Gary