[SciPy-User] kmeans

Thu Jul 22 15:23:47 EDT 2010

On Thu, Jul 22, 2010 at 12:12 PM, alex <argriffi at ncsu.edu> wrote:
>>
>>> For example:
>>>
>>> >>> import numpy as np
>>> >>> from scipy import cluster
>>> >>> v = np.array([1,2,3,4,10])
>>> >>> cluster.vq.kmeans(v, 1)
>>> (array([4]), 2.3999999999999999)
>>> >>> np.mean([abs(x-4) for x in v])
>>> 2.3999999999999999
>>> >>> np.mean([abs(x-3) for x in v])
>>> 2.2000000000000002
>>>
>>> The result of this kmeans call suggests that the center 4 is best with
>>> distortion 2.4.  In fact this is not the case because a center of 3
would
>>> have distortion 2.2.
>>>
>>
>> I wonder if this is really a bug in the minimization code rather than an
>> issue with the distortion measure itself.
>>
>> Ben Root
>
> The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
> avg_dist[-1] <= thresh as a stopping condition.  This condition mistakenly
> assumes that the distortion monotonically decreases.  One consequence is
> that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
> negative, and the codebook and distortion associated with avg_dist[-1] are
> returned.  This is where the 2.4 vs 2.2 error comes from.
>
> I guess there could be a few ways to resolve the bug.  One way could be to
> use the sum of squares of distances instead of the distortion; this would
> guarantee that the error sequence monotonically decreases, and I suspect
> that this is what the author had originally intended.
>
> Another way to deal with the bug could be to report the second to last
> codebook and distortion instead of the last codebook and distortion when
the
> stopping condition is met.  This would probably fix the bug in the 2.2 vs.
> 2.4 example, but it is kind of a kludge; if the sequence does not
> monotonically decrease, then does it really make sense to use a small
change
> as a stopping condition?
>
> Alex
>

I have to search for some old code of mine, but if I remember correctly, it
would seem that the latter solution is actually a better fix because it
addresses the heart of the matter.  A minimization algorithm that doesn't
return the most minimum value that it knows is buggy.  The stopping
condition is another issue altogether.

There was a nice C-implemented, MIT-licensed kmeans algorithm that I used
several years ago that had a much smarter stopping condition and several
other useful features.  Let me see if I can find it and we can compare.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/0b1178de/attachment.html>