[SciPy-User] kmeans

Charles R Harris charlesr.harris at gmail.com
Thu Jul 22 13:07:53 EDT 2010


On Thu, Jul 22, 2010 at 10:24 AM, Benjamin Root <ben.root at ou.edu> wrote:

> On Thu, Jul 22, 2010 at 10:42 AM, alex <argriffi at ncsu.edu> wrote:
>
>> On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <ben.root at ou.edu> wrote:
>>
>>> On Wed, Jul 21, 2010 at 3:25 PM, alex <argriffi at ncsu.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to nitpick about the scipy kmeans clustering implementation.
>>>> Throughout the documentation
>>>> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
>>>> 'distortion' of a clustering is defined as "the sum of the distances between
>>>> each observation vector and its dominating centroid."  I think that the sum
>>>> of squares of distances should be used instead of the sum of distances, and
>>>> all of the miscellaneous kmeans descriptions I found with google would seem
>>>> to support this.
>>>>
>>>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and
>>>> the old center is 3, then the centroid updating step will move the centroid
>>>> to 4.  This step reduces the sum of squares of distances from 55 to 50, but
>>>> it increases the distortion from 11 to 12.
>>>>
>>>> Alex
>>>>
>>>
>>> Every implementation of kmeans (except for SciPy's) that I have seen
>>> allowed for the user to specify which distance measure they want to use.
>>> There is no right answer for a distance measure except for "it depends".
>>> Maybe SciPy's implementation should be updated to allow for user-specified
>>> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
>>>
>>> Ben Root
>>>
>>
>> While the best distance might depend on the application, I think that of
>> all these distance measures only the sum of squares of Euclidean distances
>> is guaranteed to monotonically decrease at each step of the algorithm.  If
>> the scipy kmeans implementation depends on this monotonicity, which I think
>> it currently does, then this assumption could be the source of subtle bugs
>> when error measures like distortion are used.
>>
>> Maybe I can nitpick more effectively if I frame it as "scipy is doing
>> something strange and possibly buggy" instead of as "scipy is not using the
>> distance function I want".
>>
>>
> It has been a while since I played with kmeans, but I believe that the
> distance measure merely has to satisfy Minkowski's inequality which is that
> the norm of a + b <= norm of a + norm of b (or was it the Cauchy-Schwarz
> inequality?)
>

That's the triangle inequality and is a required property of anything called
a norm.

<snip>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/d7199cdd/attachment.html>


More information about the SciPy-User mailing list