[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Fri Jul 23 19:00:21 EDT 2010


On Fri, Jul 23, 2010 at 5:15 PM, Keith Goodman <kwgoodman at gmail.com> wrote:

> On Fri, Jul 23, 2010 at 2:55 PM, Benjamin Root <ben.root at ou.edu> wrote:
> > On Fri, Jul 23, 2010 at 4:18 PM, Lutz Maibaum <lutz.maibaum at gmail.com>
> > wrote:
> >>
> >> On Jul 23, 2010, at 1:12 PM, Keith Goodman wrote:
> >> > On Fri, Jul 23, 2010 at 1:01 PM, Benjamin Root <ben.root at ou.edu>
> wrote:
> >> >> On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman at gmail.com>
> >> >> wrote:
> >> >>> That's a good point. Are all these considered "bugs"?
> >> >>>
> >> >>> - Switch code and doc to use rmse
> >> >>> - Integer bug
> >> >>> - Select initial centroids without replacement
> >> >>
> >> >> My vote is yes, although I am not 100% convinced that the integer bug
> >> >> should
> >> >> be changed because it may cause breakage with those who have been
> >> >> depending
> >> >> on integer output.
> >> >
> >> > Maybe just make a ticket for now for the integer problem? Lutz, do you
> >> > want to make the ticket?
> >>
> >> I have opened a ticket (#1246). An easy and safe fix would be to simply
> >> add a statement to the docstring that warns the user about clustering
> >> integer data.
> >>
> >> > It would be nice to find a simple problem that gives the wrong
> >> > centroids due to the sum of dist bug. We could use that for a unit
> >> > test of the fix. The example given earlier in the thread returns the
> >> > right centroid. I guess we need a ticket for this one too.
> >>
> >>
> >> Actually, it not entirely clear to me anymore what the bug is. According
> >> to the k-means Wikipedia page, the objective function that the algorithm
> >> tries to minimize is the total intra-cluster variance (the sum of
> squares of
> >> distances of data points from cluster centroids). However, the two steps
> of
> >> the iteration (assignment to centroids, and centroid update) use regular
> >> distances and means. Is this not what the current code is doing?
> >>
> >
> > Which is why I have been saying that there is no bug here because the
> code
> > is technically correct.  A mean of regular distances is a sum of squared
> > distances that has been divided.  The only reason why the current code is
> > not returning the correct answer for the given example is that it never
> > tries 3 as a centroid value.  This is a different issue.
> >
> > [snip]
> >
> >> One issue I see is that the documentation mentions that k-means tries to
> >> minimize distorition, defined as the sum of distances, which (at least
> >> according to the Wikipedia page) is not correct, because it tries to
> >> minimize the sum of squared distances.
> >>
> >
> > Exactly... the documentation is wrong.  Kmeans works to minimize the sum
> of
> > squared distances.  How you define distances is up to you, so long as it
> > satisfies the triangle inequality.  And, as far as I can see, the code is
> > doing exactly this using euclidean distance.
>
> My understanding is that the problem is the stopping condition. Each
> iteration lowers the sum of squares but iteration stops when the mean
> non-squared distance is below a threshold.
>

The stopping condition uses the change in the distortion, not a non-squared
distance.  The distortion is already a sum of squares.  The only place that
a non-squared distance is used is in _py_vq_1d() which appears to be very
old code and it has a raise error at the very first statement.

I have gone ahead and started to modify the documentation for vq and
vq.kmeans in order to correct the mistakes and clarify the document.  This
is being done in the Doc Editor, as it should, so feel free to comment on my
changes.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/3cb63da6/attachment.html>


More information about the SciPy-User mailing list