[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

Tue Mar 18 05:03:30 EDT 2014

2014-03-18 5:29 GMT+08:00 Ralf Gommers <ralf.gommers at gmail.com>:

>
> On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404 at gmail.com>wrote:
>
>> Hi,
>>
>> I looked at several ML packages and I found that ELKI has implemented a
>> optimized single linkage algorithm called SLINK[1][2]. And I also found a
>> similar algorithm called CLINK[3], which is for complete linkage. It seems
>> that these two algorithms are much faster and use less memory than the
>> naive algorithms we are using in cluster.hierarchy currently.
>>
>
> Looks like ELKI as a mix of licenses including BSD, but the default is
> AGPL. Did you check for this algorithm? Not entirely clear from the above
> if you planned to integrate or reimplement it, for the former it matters.
>

The SLINK java implementation in ELKI should be under AGPL but these two
algorithms were published decades ago and should be not patented.

>
>
>> I also read some IPython notebooks and StackOverflow posts recently and I
>> found that many people are discussing how to plot a heatmap of hierarchical
>> clustering. I think if we integrate it into cluster.hierarchy, it will be a
>> good complement to hierarchy.dendrogram.
>>
>
> Assuming that it doesn't take too much time to implement (plotting
> shouldn't be a focus), that sounds fine. There are some more functions in
> several packages that optionally use MPL.
>
>
>> Besides, I noticed that the cluster package is single-threaded currently.
>> I don't know if parallelization in scipy level rather than BLAS level is
>> proper,
>>
>
> That has been out of scope for scipy until now.
>
>
>>  but at least we can just make use of the BLAS library (if it supports)
>> to parallelize the kmeans algorithm.
>>
>
> Are you talking about the for-loop over observations in vq()? I don't see
> any linalg going on in kmeans.
>

We can expand the formula when calculating the distances then make use of
some BLAS functions. I've just written a demo of it:
https://gist.github.com/richardtsai/9614846 (written casually, a bit messy)
It runs about 16x faster than the original C version in a 100000x500
dataset with k = 30. (Built with a thread-enabled ATLAS)

In [30]: X.shape
Out[30]: (100000, 500)

In [31]: c.shape
Out[31]: (30, 500)

In [32]: %timeit _vq.vq(X, c)
1 loops, best of 3: 1.28 s per loop

In [33]: %timeit _vq_rewrite.vq(X, c)
10 loops, best of 3: 79.6 ms per loop

>
> Ralf
>
>
>
>>
>> [1]:
>> http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/algorithm/clustering/hierarchical/SLINK.html
>> [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf
>> [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
>>
>> Regards,
>> Richard
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140318/442e26eb/attachment.html>