[SciPy-User] kmeans and initial centroid guesses

David Cournapeau cournape at gmail.com
Sun Dec 27 20:47:45 EST 2009


On Mon, Dec 28, 2009 at 10:37 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> The kmeans function has two modes. In one of the modes the initial
> guesses for the centroids are randomly selected from the input data.
> The selection is currently done with replacement:
>
> guess = take(obs, randint(0, No, k), 0)
>
> That means some of the centroids in the intial guess might be the
> same. Wouldn't it be better to select without replacement?

I think you are right, but random sampling without replacement for
floating point values is a bit hard to use here: if two values are
different but very close, you would see the same effect, right ?

Generally, for clustering algorithms, I think you'd you want to start
with centroids as far from each other as possible, so maybe the code
could be improved taking this into account.

cheers,

David



More information about the SciPy-User mailing list