[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Fri Jul 23 18:53:55 EDT 2010


On Fri, Jul 23, 2010 at 5:27 PM, Lutz Maibaum <lutz.maibaum at gmail.com>wrote:

> On Jul 23, 2010, at 2:55 PM, Benjamin Root wrote:
> > On Fri, Jul 23, 2010 at 4:18 PM, Lutz Maibaum <lutz.maibaum at gmail.com>
> wrote:
> >> Actually, it not entirely clear to me anymore what the bug is. According
> to the k-means Wikipedia page, the objective function that the algorithm
> tries to minimize is the total intra-cluster variance (the sum of squares of
> distances of data points from cluster centroids). However, the two steps of
> the iteration (assignment to centroids, and centroid update) use regular
> distances and means. Is this not what the current code is doing?
> >
> > Which is why I have been saying that there is no bug here because the
> code is technically correct.  A mean of regular distances is a sum of
> squared distances that has been divided.  The only reason why the current
> code is not returning the correct answer for the given example is that it
> never tries 3 as a centroid value.  This is a different issue.
>
> I apologize if I am being obtuse, but why do you think the current code
> does not return the correct answer?
>
> >>> import numpy as np
> >>> from scipy import cluster
> >>> v = np.array([1,2,3,4,10],dtype=float)
> >>> cluster.vq.kmeans(v, 1)
> (array([ 4.]), 2.3999999999999999)
> >>> np.sum([abs(x-4)**2 for x in v])
> 50.0
> >>> np.sum([abs(x-3)**2 for x in v])
> 55.0
>
> The centroid 4 minimizes the sum of squared distances, which is what kmeans
> is supposed to find.
>
> Best,
>
>  Lutz
>
>
Right, sorry, I forgot that we already figured that out.  So, there is no
bug in this respect.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/8548a86f/attachment.html>


More information about the SciPy-User mailing list