[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

David Warde-Farley d.warde.farley at gmail.com
Fri Mar 14 10:29:52 EDT 2014


On the side of hierarchical clustering, I think it would be very
instructive to look at existing _software packages_ for doing
hierarchical clustering rather than just the research literature.

I think promoting the fact that this part of the library even exists
and showing people accustomed to other tools how to use it (e.g. with
IPython notebooks on the subject, demonstrating plots and analysis and
so on...) would make a good complement to what you've proposed.

On Fri, Mar 14, 2014 at 9:17 AM, Richard Tsai <richard9404 at gmail.com> wrote:
> Hi David,
>
> Thanks for your advice! I'll improve my proposal and pay more attention to
> documentation. I agree that vq module should be kept simple but
> high-performance so I'll focus on the optimization of it. And I'll read some
> materials on hierarchical clustering and find some potential improvements to
> it recently.
>
> Regards,
> Richard
>
>
> 2014-03-14 15:32 GMT+08:00 David Warde-Farley <d.warde.farley at gmail.com>:
>
>> Hi,
>>
>> FWIW, I think this is a pretty good proposal, but I worry that some of
>> it duplicates work that's already taken place in scikit-learn.
>>
>> I think that a high-performance vq module is an important thing to
>> have in SciPy itself (though Jake Vanderplas did some work on distance
>> computations in Cython for scikit-learn that should be leveraged if
>> possible, maybe Jake has thoughts on factoring it into a separate
>> package?) and to my knowledge, the hierarchy module is not duplicated
>> to a great extent in scikit-learn. I'd thus prioritize those two
>> things, *including* sprucing up their documentation (SciPy is a fairly
>> mature project, and one where documentation is, ideally, not an
>> afterthought).
>>
>> Things like mini-batch k-means and automatic determination of k are
>> interesting but more scikit-learn territory. I would leave these
>> things to the end, on an if-there's-time basis.
>>
>> Since that _vq_rewrite was written, Cython has introduced much cleaner
>> memoryviews. Definitely prefer those over the deprecated ndarray
>> syntax.
>>
>> On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404 at gmail.com>
>> wrote:
>> > Hi all,
>> > I wrote a draft proposal for my GSoC about the cluster package. I post
>> > to
>> > the list hoping for advice. However, as Ralf said, cluster is not well
>> > maintained now. And I am still not be able to find someone who know
>> > about
>> > cluster analysis to mentor me. If you have any suggestions for my
>> > proposal,
>> > or are willing to mentor me, please let me know and I will be really
>> > grateful.
>> >
>> > Regards,
>> > Richard
>> >
>> > Proposal Title: SciPy: Rewrite and improve cluster package in Cython
>> >
>> > Proposal Abstract
>> >
>> > According to the roadmap to SciPy 1.0, the cluster package needs a
>> > Cython
>> > rewrite the make it more maintainable and efficient. Besides, there's
>> > room
>> > for improvement in cluster.vq module. Some useful features can be added
>> > and
>> > the performance can be improved when dealing with large datasets.
>> >
>> > Proposal Detailed Description/Timeline
>> >
>> > There's an experimental Cython implementation of the vq module in the
>> > source
>> > tree. However, it has not been maintained for about 2 years and it only
>> > supports single precision datasets, and it's also slower than the
>> > original
>> > implementation.
>> >
>> > I plan to start with some cleanup job, then finish the double precision
>> > support. After some optimizations and tuning it should be mature enough
>> > the
>> > replace the original implementation.
>> >
>> > After that, I'm going to implementation a mini-batch optimization for
>> > kmeans/kmeans2 function based on a paper ("Web-Scale K-Means
>> > Clustering")
>> > and it should greatly improve the performance for large datasets. In
>> > addition, I think the support for automatically determine the number of
>> > clusters via some methods (e.g. gap statistics) can be included in this
>> > module.
>> >
>> > As for the hierarchy module, it is rather full-featured now, but the
>> > Cython
>> > rewrite has yet begun. I'll rewrite the high level part in Cython first
>> > since it it convenient to call the original C underlying functions in
>> > Cython
>> > code. I'll migrate the underlying part from C to Cython gradually at
>> > last.
>> >
>> > My detailed timeline is as follows.
>> >
>> > Week 1: Do some cleanup for the existing experimental Cython version of
>> > vq
>> > (bugs, docs, etc.), unit tests, performance benchmarks for datasets of
>> > various sizes and distributions.
>> > Week 2: Finish the double precision support in the Cython version of vq,
>> > try
>> > to migrate some Python code to Cython to gain performance improvement.
>> > Week 3: Do some performance profiling, continue to optimize the
>> > performance
>> > of vq, try to replace the original C implementation with the new Cython
>> > implementation.
>> > Week 4: Implement the mini-batch K-means algorithm.
>> > Week 5: Add support for automatically determine the number of clusters.
>> > Week 6: Maneuver time. Finish the work that is behind schedule, and try
>> > some
>> > potential optimizations.
>> > Week 7: Build a framework for the Cython implementation of the hierarchy
>> > module. The work should be just translate the wrapper functions in
>> > hierarchy_wrap.c into Cython so there may be no performance gains by
>> > then.
>> > Week 8-9: Rewrite the underlying implementation of the hierarchy module
>> > in
>> > Cython. The major work is to translate hierarchy.c into Cython.
>> > Week 10: Optimize the Cython implementation of the hierarchy module,
>> > replace
>> > the original implementation if possible.
>> > Remaining time (if there is): Improve the documents, add some sample
>> > code
>> > especially for the hierarchy module.
>> >
>> > Code Sample
>> >
>> > My previous patches to SciPy can be found in
>> > https://github.com/scipy/scipy/pulls/richardtsai?state=closed
>> > I haven't submitted code to the cluster package but I'll probably make a
>> > related PR soon.
>> >
>> >
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>



More information about the SciPy-Dev mailing list