[SciPy-User] kmeans

Sun Jul 25 15:41:32 EDT 2010

On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> _kmeans chokes on large thresholds:
>
>>> from scipy import cluster
>>> v = np.array([1,2,3,4,10], dtype=float)
>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>   (array([ 4.]), 2.3999999999999999)
>>> cluster.vq.kmeans(v, 1, thresh=1e16)
> <snip>
> IndexError: list index out of range
>
> The problem is in these lines:
>
>    diff = thresh+1.
>    while diff > thresh:
>        <snip>
>        if(diff > thresh):
>
> If thresh is large then (thresh + 1) > thresh is False:
>
>>> thresh = 1e16
>>> diff = thresh + 1.0
>>> diff > thresh
>   False
>
> What's a use case for a large threshold? You might want to study the
> algorithm by seeing the result after one iteration (not to be confused
> with the iter input which is something else).
>
> One fix is to use 2*thresh instead for thresh + 1. But that just
> pushes the problem out to higher thresholds

Or just use the spacing function, which by definition returns the
smallest number M such as thresh + M > thresh (except for nan/inf)

David