[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

Tue Mar 18 17:38:33 EDT 2014

On Tue, Mar 18, 2014 at 10:03 AM, Richard Tsai <richard9404 at gmail.com>wrote:

> 2014-03-18 5:29 GMT+08:00 Ralf Gommers <ralf.gommers at gmail.com>:
>
>
>> On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404 at gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I looked at several ML packages and I found that ELKI has implemented a
>>> optimized single linkage algorithm called SLINK[1][2]. And I also found a
>>> similar algorithm called CLINK[3], which is for complete linkage. It seems
>>> that these two algorithms are much faster and use less memory than the
>>> naive algorithms we are using in cluster.hierarchy currently.
>>>
>>
>> Looks like ELKI as a mix of licenses including BSD, but the default is
>> AGPL. Did you check for this algorithm? Not entirely clear from the above
>> if you planned to integrate or reimplement it, for the former it matters.
>>
>
> The SLINK java implementation in ELKI should be under AGPL but these two
> algorithms were published decades ago and should be not patented.
>
>
>>
>>
>>> I also read some IPython notebooks and StackOverflow posts recently and
>>> I found that many people are discussing how to plot a heatmap of
>>> hierarchical clustering. I think if we integrate it into cluster.hierarchy,
>>> it will be a good complement to hierarchy.dendrogram.
>>>
>>
>> Assuming that it doesn't take too much time to implement (plotting
>> shouldn't be a focus), that sounds fine. There are some more functions in
>> several packages that optionally use MPL.
>>
>>
>>> Besides, I noticed that the cluster package is single-threaded
>>> currently. I don't know if parallelization in scipy level rather than BLAS
>>> level is proper,
>>>
>>
>> That has been out of scope for scipy until now.
>>
>>
>>>  but at least we can just make use of the BLAS library (if it supports)
>>> to parallelize the kmeans algorithm.
>>>
>>
>> Are you talking about the for-loop over observations in vq()? I don't see
>> any linalg going on in kmeans.
>>
>
> We can expand the formula when calculating the distances then make use of
> some BLAS functions. I've just written a demo of it:
> https://gist.github.com/richardtsai/9614846 (written casually, a bit
> messy)
> It runs about 16x faster than the original C version in a 100000x500
> dataset with k = 30. (Built with a thread-enabled ATLAS)
>
> In [30]: X.shape
> Out[30]: (100000, 500)
>
> In [31]: c.shape
> Out[31]: (30, 500)
>
> In [32]: %timeit _vq.vq(X, c)
> 1 loops, best of 3: 1.28 s per loop
>
> In [33]: %timeit _vq_rewrite.vq(X, c)
> 10 loops, best of 3: 79.6 ms per loop
>

OK clear. That's a pretty decent speed-up:)

Ralf

>
>
>>
>> Ralf
>>
>>
>>
>>>
>>> [1]:
>>> http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/algorithm/clustering/hierarchical/SLINK.html
>>> [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf
>>> [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
>>>
>>> Regards,
>>> Richard
>>>
>>>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140318/99482d56/attachment.html>