[SciPy-User] kmeans and initial centroid guesses

Sun Dec 27 20:47:45 EST 2009

On Mon, Dec 28, 2009 at 10:37 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> The kmeans function has two modes. In one of the modes the initial
> guesses for the centroids are randomly selected from the input data.
> The selection is currently done with replacement:
>
> guess = take(obs, randint(0, No, k), 0)
>
> That means some of the centroids in the intial guess might be the
> same. Wouldn't it be better to select without replacement?

I think you are right, but random sampling without replacement for
floating point values is a bit hard to use here: if two values are
different but very close, you would see the same effect, right ?

Generally, for clustering algorithms, I think you'd you want to start
with centroids as far from each other as possible, so maybe the code
could be improved taking this into account.

cheers,

David