[SciPy-Dev] kmeans with weights to scipy.cluster

Richard Tsai richard9404 at gmail.com
Fri Apr 25 04:21:02 EDT 2014


2014-04-24 17:07 GMT+08:00 Jiri Krtek <Jiri.Krtek at rsj.com>:

>  I only added a few rows to the code of kmeans2 and _kmeans2 functions.
> Actually, I added to kmeans2 function only new optional parameter weights.
> The same I did in the function _kmeans2, but there I also added the testing
> if weights is None and the subsequent 2 rows. I also imported average from
> numpy.
>

kmeans2 should set the weights to all 1 when None is given then the branch
in the loop in _kmeans2 could be avoided. Besides, kmeans2 should also
check the size of weights.


>
>
> Here it is:
>
>
>
> def kmeans2(data, k, iter=10, thresh=1e-5, minit='random', missing='warn',
> weights=None):
>
>
>
>     if missing not in _valid_miss_meth:
>
>         raise ValueError("Unkown missing method: %s" % str(missing))
>
>     # If data is rank 1, then we have 1 dimension problem.
>
>     nd = ndim(data)
>
>     if nd == 1:
>
>         d = 1
>
>         # raise ValueError("Input of rank 1 not supported yet")
>
>     elif nd == 2:
>
>         d = data.shape[1]
>
>     else:
>
>         raise ValueError("Input of rank > 2 not supported")
>
>
>
>     if size(data) < 1:
>
>         raise ValueError("Input has 0 items.")
>
>
>
>     # If k is not a single value, then it should be compatible with data's
>
>     # shape
>
>     if size(k) > 1 or minit == 'matrix':
>
>         if not nd == ndim(k):
>
>             raise ValueError("k is not an int and has not same rank than
> data")
>
>         if d == 1:
>
>             nc = len(k)
>
>         else:
>
>             (nc, dc) = k.shape
>
>             if not dc == d:
>
>                 raise ValueError("k is not an int and has not same rank
> than\
>
>                         data")
>
>         clusters = k.copy()
>
>     else:
>
>         try:
>
>             nc = int(k)
>
>         except TypeError:
>
>             raise ValueError("k (%s) could not be converted to an integer
> " % str(k))
>
>
>
>         if nc < 1:
>
>             raise ValueError("kmeans2 for 0 clusters ? (k was %s)" %
> str(k))
>
>
>
>         if not nc == k:
>
>             warn("k was not an integer, was converted.")
>
>         try:
>
>             init = _valid_init_meth[minit]
>
>         except KeyError:
>
>             raise ValueError("unknown init method %s" % str(minit))
>
>         clusters = init(data, k)
>
>
>
>     if int(iter) < 1:
>
>         raise ValueError("iter = %s is not valid.  iter must be a positive
> integer." % iter)
>
>
>
>     return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing],
> weights)
>
>
>
>
>
> def _kmeans2(data, code, niter, nc, missing, weights=None):
>
>     """ "raw" version of kmeans2. Do not use directly.
>
>
>
>     Run k-means with a given initial codebook.
>
>
>
>     """
>
>     for i in range(niter):
>
>         # Compute the nearest neighbour for each obs
>
>         # using the current code book
>
>         label = vq(data, code)[0]
>
>         # Update the code by computing centroids using the new code book
>
>         for j in range(nc):
>
>             mbs = where(label == j)
>
>             if mbs[0].size > 0:
>
>                 if weights is not None:
>
>                     code[j] = average(data[mbs], axis=0,
> weights=weights[mbs])
>
>                 else:
>
>                     code[j] = mean(data[mbs], axis=0)
>
>             else:
>
>                 missing()
>
>
>
>     return code, label
>
>
>
>
>
> I haven’t hacked the initial centroids creation yet, but it should be
> hacked. For example when minit=’points’, then the points should have the
> probability of selection given by their weights (because when some point
> has weight much higher than other points, it should be near the center, or
> even it should be the center and should be alone in its cluster). If
> minit=’random’, then the mean and cov in the _krandinit function should be
> affected by weights.
>

I think the minit='points' case can be implemented with np.random.choice.
As for the minit='random' case, I am not so sure. Maybe scale the
observation matrix according to weights before passing to _krandinit?


>
>
 Jiri
>

Consider making a pull request on Github?

Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140425/940df5ea/attachment.html>


More information about the SciPy-Dev mailing list