[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Fri Jul 23 15:40:58 EDT 2010


On Fri, Jul 23, 2010 at 2:06 PM, Lutz Maibaum <lutz.maibaum at gmail.com>wrote:

> On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <kwgoodman at gmail.com>
> wrote:
> > On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum <lutz.maibaum at gmail.com>
> wrote:
> >> To be compatible with the (at least to me!) standard use of k-means, I
> >> think both code and doc should use the sum of squared distances as the
> >> cost function in the optimization, and also as the return value.
> >
> > What about the thresh (threshold) input parameter? If the sum of
> > squares were used then the user would have to adjust the threshold for
> > the number of data points.
>
> That's true, but personally I don't find that much of a problem. Using
> an absolute threshold one needs to have some intuition about the
> magnitude of the cost function based on the type and amount of data.
> Alternatively, one could use a relative improvement as the convergence
> criterion, for example (something like "if
> (old_cost-new_cost)/old_cost < threshhold then converged"), which may
> be suitable for a larger variety of clustering problems.
>
>  -- Lutz
>

However, we wouldn't want to change the characteristic behavior of kmeans...
yet.

Personally, I never liked using tolerances and thresholds for stopping
conditions,
which is why I like the C Clustering library's approach of iterating until
there are
no more reassignments (or max iterations).  Although, I can't remember how
it
handles the edge case of assignments getting passed back and forth between
members.

Just to be clear, the C Clustering library's implementation of kmeans is
entirely
different from SciPy's implementation.  While I am certainly no expert in
determining
which approach is better than another, I can say that I have used it before
and it has
worked very nicely for me and my uses.

Ben Root

P.S. - As a complete side-note, while I am in this nostalgic fervor, a
particularly clever use
of kmeans/kmedians that I came up with was to 'snap' similar grids to a
common grid without requiring
one to predefine that grid.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/5a7f51bd/attachment.html>


More information about the SciPy-User mailing list