[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

Fri Mar 14 09:17:04 EDT 2014

Hi David,

Thanks for your advice! I'll improve my proposal and pay more attention to
documentation. I agree that vq module should be kept simple but
high-performance so I'll focus on the optimization of it. And I'll read
some materials on hierarchical clustering and find some potential
improvements to it recently.

Regards,
Richard

2014-03-14 15:32 GMT+08:00 David Warde-Farley <d.warde.farley at gmail.com>:

> Hi,
>
> FWIW, I think this is a pretty good proposal, but I worry that some of
> it duplicates work that's already taken place in scikit-learn.
>
> I think that a high-performance vq module is an important thing to
> have in SciPy itself (though Jake Vanderplas did some work on distance
> computations in Cython for scikit-learn that should be leveraged if
> possible, maybe Jake has thoughts on factoring it into a separate
> package?) and to my knowledge, the hierarchy module is not duplicated
> to a great extent in scikit-learn. I'd thus prioritize those two
> things, *including* sprucing up their documentation (SciPy is a fairly
> mature project, and one where documentation is, ideally, not an
> afterthought).
>
> Things like mini-batch k-means and automatic determination of k are
> interesting but more scikit-learn territory. I would leave these
> things to the end, on an if-there's-time basis.
>
> Since that _vq_rewrite was written, Cython has introduced much cleaner
> memoryviews. Definitely prefer those over the deprecated ndarray
> syntax.
>
> On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404 at gmail.com>
> wrote:
> > Hi all,
> > I wrote a draft proposal for my GSoC about the cluster package. I post to
> > the list hoping for advice. However, as Ralf said, cluster is not well
> > maintained now. And I am still not be able to find someone who know about
> > cluster analysis to mentor me. If you have any suggestions for my
> proposal,
> > or are willing to mentor me, please let me know and I will be really
> > grateful.
> >
> > Regards,
> > Richard
> >
> > Proposal Title: SciPy: Rewrite and improve cluster package in Cython
> >
> > Proposal Abstract
> >
> > According to the roadmap to SciPy 1.0, the cluster package needs a Cython
> > rewrite the make it more maintainable and efficient. Besides, there's
> room
> > for improvement in cluster.vq module. Some useful features can be added
> and
> > the performance can be improved when dealing with large datasets.
> >
> > Proposal Detailed Description/Timeline
> >
> > There's an experimental Cython implementation of the vq module in the
> source
> > tree. However, it has not been maintained for about 2 years and it only
> > supports single precision datasets, and it's also slower than the
> original
> > implementation.
> >
> > I plan to start with some cleanup job, then finish the double precision
> > support. After some optimizations and tuning it should be mature enough
> the
> > replace the original implementation.
> >
> > After that, I'm going to implementation a mini-batch optimization for
> > kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering")
> > and it should greatly improve the performance for large datasets. In
> > addition, I think the support for automatically determine the number of
> > clusters via some methods (e.g. gap statistics) can be included in this
> > module.
> >
> > As for the hierarchy module, it is rather full-featured now, but the
> Cython
> > rewrite has yet begun. I'll rewrite the high level part in Cython first
> > since it it convenient to call the original C underlying functions in
> Cython
> > code. I'll migrate the underlying part from C to Cython gradually at
> last.
> >
> > My detailed timeline is as follows.
> >
> > Week 1: Do some cleanup for the existing experimental Cython version of
> vq
> > (bugs, docs, etc.), unit tests, performance benchmarks for datasets of
> > various sizes and distributions.
> > Week 2: Finish the double precision support in the Cython version of vq,
> try
> > to migrate some Python code to Cython to gain performance improvement.
> > Week 3: Do some performance profiling, continue to optimize the
> performance
> > of vq, try to replace the original C implementation with the new Cython
> > implementation.
> > Week 4: Implement the mini-batch K-means algorithm.
> > Week 5: Add support for automatically determine the number of clusters.
> > Week 6: Maneuver time. Finish the work that is behind schedule, and try
> some
> > potential optimizations.
> > Week 7: Build a framework for the Cython implementation of the hierarchy
> > module. The work should be just translate the wrapper functions in
> > hierarchy_wrap.c into Cython so there may be no performance gains by
> then.
> > Week 8-9: Rewrite the underlying implementation of the hierarchy module
> in
> > Cython. The major work is to translate hierarchy.c into Cython.
> > Week 10: Optimize the Cython implementation of the hierarchy module,
> replace
> > the original implementation if possible.
> > Remaining time (if there is): Improve the documents, add some sample code
> > especially for the hierarchy module.
> >
> > Code Sample
> >
> > My previous patches to SciPy can be found in
> > https://github.com/scipy/scipy/pulls/richardtsai?state=closed
> > I haven't submitted code to the cluster package but I'll probably make a
> > related PR soon.
> >
> >
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140314/d5768327/attachment.html>