[Numpy-discussion] (K-Mean) Clustering

Mon Jun 24 06:56:04 EDT 2002

Hi!

I've been looking for an implementation of k-means clustering in
Python, and haven't really found anything I could use... I believe
there is one in SciPy, but I'd rather keep the required number of
packages as low as possible (already using Numeric/numarray), and
Orange seems a bit hard to install in UNIX... So, I've fiddled with
using Numeric/numarray for the purpose. Has anyone else done something
like this (or some other clustering algorithm for that matter)?

The approach I've been using (but am not completely finished with) is
to use a two-dimensional multiarray for the data (i.e. a "set" of
vectors) and a one-dimensional array with a cluster assignment for
each vector. E.g.

>>> data[42]
array([1, 2, 3, 4, 5])
>>> cluster[42]
10
>>> reps[10]
array([1, 2, 4, 5, 4])

Here reps is the representative of the cluster.

Using argmin it should be relatively easy to assign each vector to the
cluster with the closest representative (using sum((x-y)**2) as the
distance measure), but how do I calculate the new representatives
effectively? (The representative of a cluster, e.g., 10, should be the
average of all vectors currently assigned to that cluster.) I could
always use a loop and then compress() the data based on cluster
number, but I'm looking for a way of calculating all the averages
"simultaneously", to avoid using a Python loop... I'm sure there's a
simple solution -- I just haven't been able to think of it yet. Any
ideas?

--
Magnus Lie Hetland                                  The Anygui Project
http://hetland.org                                  http://anygui.org