[SciPy-Dev] kmeans with weights to scipy.cluster

Jiri Krtek Jiri.Krtek at rsj.com
Thu Apr 24 05:07:03 EDT 2014


I only added a few rows to the code of kmeans2 and _kmeans2 functions. Actually, I added to kmeans2 function only new optional parameter weights. The same I did in the function _kmeans2, but there I also added the testing if weights is None and the subsequent 2 rows. I also imported average from numpy.

Here it is:

def kmeans2(data, k, iter=10, thresh=1e-5, minit='random', missing='warn', weights=None):

    if missing not in _valid_miss_meth:
        raise ValueError("Unkown missing method: %s" % str(missing))
    # If data is rank 1, then we have 1 dimension problem.
    nd = ndim(data)
    if nd == 1:
        d = 1
        # raise ValueError("Input of rank 1 not supported yet")
    elif nd == 2:
        d = data.shape[1]
    else:
        raise ValueError("Input of rank > 2 not supported")

    if size(data) < 1:
        raise ValueError("Input has 0 items.")

    # If k is not a single value, then it should be compatible with data's
    # shape
    if size(k) > 1 or minit == 'matrix':
        if not nd == ndim(k):
            raise ValueError("k is not an int and has not same rank than data")
        if d == 1:
            nc = len(k)
        else:
            (nc, dc) = k.shape
            if not dc == d:
                raise ValueError("k is not an int and has not same rank than\
                        data")
        clusters = k.copy()
    else:
        try:
            nc = int(k)
        except TypeError:
            raise ValueError("k (%s) could not be converted to an integer " % str(k))

        if nc < 1:
            raise ValueError("kmeans2 for 0 clusters ? (k was %s)" % str(k))

        if not nc == k:
            warn("k was not an integer, was converted.")
        try:
            init = _valid_init_meth[minit]
        except KeyError:
            raise ValueError("unknown init method %s" % str(minit))
        clusters = init(data, k)

    if int(iter) < 1:
        raise ValueError("iter = %s is not valid.  iter must be a positive integer." % iter)

    return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing], weights)


def _kmeans2(data, code, niter, nc, missing, weights=None):
    """ "raw" version of kmeans2. Do not use directly.

    Run k-means with a given initial codebook.

    """
    for i in range(niter):
        # Compute the nearest neighbour for each obs
        # using the current code book
        label = vq(data, code)[0]
        # Update the code by computing centroids using the new code book
        for j in range(nc):
            mbs = where(label == j)
            if mbs[0].size > 0:
                if weights is not None:
                    code[j] = average(data[mbs], axis=0, weights=weights[mbs])
                else:
                    code[j] = mean(data[mbs], axis=0)
            else:
                missing()

    return code, label


I haven’t hacked the initial centroids creation yet, but it should be hacked. For example when minit=’points’, then the points should have the probability of selection given by their weights (because when some point has weight much higher than other points, it should be near the center, or even it should be the center and should be alone in its cluster). If minit=’random’, then the mean and cov in the _krandinit function should be affected by weights.

Jiri

From: scipy-dev-bounces at scipy.org [mailto:scipy-dev-bounces at scipy.org] On Behalf Of Richard Tsai
Sent: Thursday, April 24, 2014 7:51 AM
To: SciPy Developers List
Subject: Re: [SciPy-Dev] kmeans with weights to scipy.cluster


2014-04-23 19:38 GMT+08:00 Jiri Krtek <Jiri.Krtek at rsj.com<mailto:Jiri.Krtek at rsj.com>>:
Hi all,

I need to take into consideration weights in k-means. Specifically, the first (Assignment) step of the k-means algorithm remains the same, but in the second (Update) step there is not a simple average, but weighted average. I searched for some implementation of such kind of algorithm in Python, but I wasn’t successful. So I hacked a little bit the kmeans2 function from scipy.cluster. I want to ask if it sounds interesting. Would you add it to scipy.cluster module?

Regards,
Jiri

Hi Jiri,

That sounds interesting. Could you provide some more details about your implementation? You can also just post your code to Github. Besides, it might be better to provide this feature in `kmeans` function as well :)

Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140424/8a5b4e01/attachment.html>


More information about the SciPy-Dev mailing list