[SciPy-User] kmeans

Fri Jul 23 16:29:57 EDT 2010

On Fri, Jul 23, 2010 at 3:18 PM, Lutz Maibaum <lutz.maibaum at gmail.com>wrote:

> On Jul 23, 2010, at 12:40 PM, Benjamin Root wrote
> >
> > Just to be clear, the C Clustering library's implementation of kmeans is
> entirely
> > different from SciPy's implementation.  While I am certainly no expert in
> determining
> > which approach is better than another, I can say that I have used it
> before and it has
> > worked very nicely for me and my uses.
>
> I am not sure the implementations are so different (possible bugs not
> withstanding ;). At implementation in the C clustering library does the
> following:
>
> 1. Start with an initial guess of the cluster assignments
> 2. Compute means for each cluster
> 3. Assign each data point to the nearest cluster mean.
> 4. If the cost function did not decrease or the maximum number of
> iterations has been reached => exit
> 5. Go to 2.
>
>
>From the C Clustering Library's documentation:

   The expectation-maximization (EM) algorithm is commonly used to find the
> partitioning
> into k groups. The first step in the EM algorithm is to create k clusters
> and randomly assign
> items (genes or microarrays) to them. We then iterate:
>   • Calculate the centroid of each cluster;
>   • For each item, determine which cluster centroid is closest;
>   • Reassign the item to that cluster.
> The iteration is stopped if no further item reassignments take place.
>

The C Clustering Library makes an initial guess of the assignments and
calculates the medians of the assignments.
SciPy's kmeans makes an initial guess of the centroids and assigns the obs
to the different centroid guesses.

It is a subtle difference, but it does result in different ways to solve the
problem.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/2f64d2d1/attachment.html>