[SciPy-User] kmeans

Fri Jul 23 18:15:59 EDT 2010

On Fri, Jul 23, 2010 at 2:55 PM, Benjamin Root <ben.root at ou.edu> wrote:
> On Fri, Jul 23, 2010 at 4:18 PM, Lutz Maibaum <lutz.maibaum at gmail.com>
> wrote:
>>
>> On Jul 23, 2010, at 1:12 PM, Keith Goodman wrote:
>> > On Fri, Jul 23, 2010 at 1:01 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> >> On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman at gmail.com>
>> >> wrote:
>> >>> That's a good point. Are all these considered "bugs"?
>> >>>
>> >>> - Switch code and doc to use rmse
>> >>> - Integer bug
>> >>> - Select initial centroids without replacement
>> >>
>> >> My vote is yes, although I am not 100% convinced that the integer bug
>> >> should
>> >> be changed because it may cause breakage with those who have been
>> >> depending
>> >> on integer output.
>> >
>> > Maybe just make a ticket for now for the integer problem? Lutz, do you
>> > want to make the ticket?
>>
>> I have opened a ticket (#1246). An easy and safe fix would be to simply
>> add a statement to the docstring that warns the user about clustering
>> integer data.
>>
>> > It would be nice to find a simple problem that gives the wrong
>> > centroids due to the sum of dist bug. We could use that for a unit
>> > test of the fix. The example given earlier in the thread returns the
>> > right centroid. I guess we need a ticket for this one too.
>>
>>
>> Actually, it not entirely clear to me anymore what the bug is. According
>> to the k-means Wikipedia page, the objective function that the algorithm
>> tries to minimize is the total intra-cluster variance (the sum of squares of
>> distances of data points from cluster centroids). However, the two steps of
>> the iteration (assignment to centroids, and centroid update) use regular
>> distances and means. Is this not what the current code is doing?
>>
>
> Which is why I have been saying that there is no bug here because the code
> is technically correct.  A mean of regular distances is a sum of squared
> distances that has been divided.  The only reason why the current code is
> not returning the correct answer for the given example is that it never
> tries 3 as a centroid value.  This is a different issue.
>
> [snip]
>
>> One issue I see is that the documentation mentions that k-means tries to
>> minimize distorition, defined as the sum of distances, which (at least
>> according to the Wikipedia page) is not correct, because it tries to
>> minimize the sum of squared distances.
>>
>
> Exactly... the documentation is wrong.  Kmeans works to minimize the sum of
> squared distances.  How you define distances is up to you, so long as it
> satisfies the triangle inequality.  And, as far as I can see, the code is
> doing exactly this using euclidean distance.

My understanding is that the problem is the stopping condition. Each
iteration lowers the sum of squares but iteration stops when the mean
non-squared distance is below a threshold.