[SciPy-dev] Another GSoC idea
David Warde-Farley
Dwf at cs.toronto.edu
Wed Mar 25 13:21:33 EDT 2009
Hi David,
Thanks for your reply - I fell ill over the weekend and then fell
behind on email (and other things :).
On 21-Mar-09, at 1:50 AM, David Cournapeau wrote:
> For scipy.cluster.vq, I already have something in Cython - just not
> put into scipy because the code is barely "research quality" (whatever
> that means :) ). But I think it would be less work to improve it than
> to start from scratch.
For sure - it's usually not a good idea to throw out code that works
unless you have a very good reason! Do you think you'll ever get
around to improving it?
> I think this would be a great addition. You are of course free to
> choose what you work on, but I like the idea of a basic set of
> recursives implementations of basic statistics and clustering
> algorithms. I have also myself an implementation of online EM for
> online estimation of GMM, based on the following preprint:
>
> http://www.citeulike.org/user/stibor/article/3245946
The idea of general "building blocks" for doing EM (and other things)
with probabilistic models in Python interests me very much, and
probably interests a lot of other people. However, it's a somewhat
ambitious undertaking, let alone for a GSoC. Part of the difficulty I
see is that there's a lot of good code that we wouldn't want to
reinvent.
There's a lot of code in, for example, PyEM that would be of use, some
of my own "research quality" machinations, but there's also the
(often ignored) maxentropy module, which as far as I know doesn't
support hidden variables but would nonetheless have useful chunks
(personally, I had encountered maxent models under the moniker of
exponential family models and forgotten the tidbit about equivalence
of the two until one day I looked at the maxentropy docs).
Then there's PyMC, which as far as I can see has developed a *really*
well thought out object-oriented system for specifying probabilistic
graphical models. Of course, it's geared toward Bayesian inference via
MCMC. In the (relatively rare) case that the posterior is analytically
available it shouldn't be all that difficult to graft on code for
doing that. Likewise with maximum likelihood (hyper)parameter fitting
via EM or gradient-based optimization.
Then there's of course code written in other languages, like Kevin
Murphy's Bayes Net toolbox for Matlab, which I recall you got
permission to port with a BSD license.
In summary, I think a general treatment of mixture models, etc. in
Python is a big task, and as such I'm not certain it'd be suitable for
a SoC. Having a really solid module with a few canned non-
probabilistic algorithms like k-means (like it already does), k-
medoids/centers might be a more manageable task in the short term.
David
More information about the SciPy-Dev
mailing list