[SciPy-User] kmeans

Sun Jul 25 17:59:53 EDT 2010

On Mon, Jul 26, 2010 at 6:53 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape at gmail.com> wrote:
>> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>> _kmeans chokes on large thresholds:
>>>
>>>>> from scipy import cluster
>>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>>   (array([ 4.]), 2.3999999999999999)
>>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>>> <snip>
>>> IndexError: list index out of range
>>>
>>> The problem is in these lines:
>>>
>>>    diff = thresh+1.
>>>    while diff > thresh:
>>>        <snip>
>>>        if(diff > thresh):
>>>
>>> If thresh is large then (thresh + 1) > thresh is False:
>>>
>>>>> thresh = 1e16
>>>>> diff = thresh + 1.0
>>>>> diff > thresh
>>>   False
>>>
>>> What's a use case for a large threshold? You might want to study the
>>> algorithm by seeing the result after one iteration (not to be confused
>>> with the iter input which is something else).
>>>
>>> One fix is to use 2*thresh instead for thresh + 1. But that just
>>> pushes the problem out to higher thresholds
>>
>> Or just use the spacing function, which by definition returns the
>> smallest number M such as thresh + M > thresh (except for nan/inf)
>
> Neat, I've never heard of np.spacing. But it suffers the same fate:
>
> Works:
>
>>> thresh = 1e16
>>> diff = thresh + np.spacing(thresh)
>>> diff > thresh
>   True
>
> Doesn't work:
>
>>> thresh = 1e400
>>> diff = thresh + np.spacing(thresh)
>>> diff > thresh
>   False

That's because 1e400 is inf for double precision numbers, and inf + N
> inf is never true :)

David