[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Fri Jul 23 16:01:28 EDT 2010


On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman at gmail.com> wrote:

> On Fri, Jul 23, 2010 at 12:40 PM, Benjamin Root <ben.root at ou.edu> wrote:
> > On Fri, Jul 23, 2010 at 2:06 PM, Lutz Maibaum <lutz.maibaum at gmail.com>
> > wrote:
> >>
> >> On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <kwgoodman at gmail.com>
> >> wrote:
> >> > On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum <
> lutz.maibaum at gmail.com>
> >> > wrote:
> >> >> To be compatible with the (at least to me!) standard use of k-means,
> I
> >> >> think both code and doc should use the sum of squared distances as
> the
> >> >> cost function in the optimization, and also as the return value.
> >> >
> >> > What about the thresh (threshold) input parameter? If the sum of
> >> > squares were used then the user would have to adjust the threshold for
> >> > the number of data points.
> >>
> >> That's true, but personally I don't find that much of a problem. Using
> >> an absolute threshold one needs to have some intuition about the
> >> magnitude of the cost function based on the type and amount of data.
> >> Alternatively, one could use a relative improvement as the convergence
> >> criterion, for example (something like "if
> >> (old_cost-new_cost)/old_cost < threshhold then converged"), which may
> >> be suitable for a larger variety of clustering problems.
> >>
> >>  -- Lutz
> >
> > However, we wouldn't want to change the characteristic behavior of
> kmeans...
> > yet.
>
> That's a good point. Are all these considered "bugs"?
>
> - Switch code and doc to use rmse
> - Integer bug
> - Select initial centroids without replacement
>

My vote is yes, although I am not 100% convinced that the integer bug should
be changed because it may cause breakage with those who have been depending
on integer output.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/ef5d2094/attachment.html>


More information about the SciPy-User mailing list