[SciPy-dev] Another GSoC idea

Wed Mar 25 13:21:33 EDT 2009

Hi David,

Thanks for your reply - I fell ill over the weekend and then fell  
behind on email (and other things :).

On 21-Mar-09, at 1:50 AM, David Cournapeau wrote:

> For scipy.cluster.vq, I already have something in Cython - just not
> put into scipy because the code is barely "research quality" (whatever
> that means :) ). But I think it would be less work to improve it than
> to start from scratch.

For sure - it's usually not a good idea to throw out code that works  
unless you have a very good reason! Do you think you'll ever get  
around to improving it?

> I think this would be a great addition. You are of course free to
> choose what you work on, but I like the idea of a basic set of
> recursives implementations of basic statistics and clustering
> algorithms. I have also myself an implementation of online EM for
> online estimation of GMM, based on the following preprint:
>
> http://www.citeulike.org/user/stibor/article/3245946

The idea of general "building blocks" for doing EM (and other things)  
with probabilistic models in Python interests me very much, and  
probably interests a lot of other people. However, it's a somewhat  
ambitious undertaking, let alone for a GSoC. Part of the difficulty I  
see is that there's a lot of good code that we wouldn't want to  
reinvent.

There's a lot of code in, for example, PyEM that would be of use, some  
of  my own "research quality" machinations, but there's also the  
(often ignored) maxentropy module, which as far as I know doesn't  
support hidden variables but would nonetheless have useful chunks  
(personally, I had encountered maxent models under the moniker of  
exponential family models and forgotten the tidbit about  equivalence  
of the two until one day I looked at the maxentropy docs).

Then there's PyMC, which as far as I can see has developed a *really*  
well thought out object-oriented system for specifying probabilistic  
graphical models. Of course, it's geared toward Bayesian inference via  
MCMC. In the (relatively rare) case that the posterior is analytically  
available it shouldn't be all that difficult to graft on code for  
doing that. Likewise with maximum likelihood (hyper)parameter fitting  
via EM or gradient-based optimization.

Then there's of course code written in other languages, like Kevin  
Murphy's Bayes Net toolbox for Matlab, which I recall you got  
permission to port with a BSD license.

In summary, I think a general treatment of mixture models, etc. in  
Python is a big task, and as such I'm not certain it'd be suitable for  
a SoC. Having a really solid module with a few canned non- 
probabilistic algorithms like k-means (like it already does), k- 
medoids/centers might be a more manageable task in the short term.

David