[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Sun Mar 16 08:41:29 EDT 2008

Hi.

I'm going to describe what problems I see in current version of
scikits.learn. After that I'll write what I want to improve during Google
Summer of Code. In my last letter I tried to numerate some limitations in
other open-source frameworks such as PyML and Orange.

Let's start about Scikits.learn.

First of all is a lack of documentation. I can find nothing beside David
Cournapeau proposal on google Summer of Code. Nothing in wiki and nothing in
maillist. There are few examples in svm, of course. But it is very hard use
only examples. I can't find parser of different data formats. Only for
datasets. As I understand datasets don't support sparse data format. There
is no common structure in ML package. It has scattered modules such as svm,
em, ann, but no main idea.

If I mistake in understanding current state of affair you can correct me.

Well, now about what I want to change.

I am going to make learn package appropriate for text classification. Also I
want to copy most of PyML (pyml.sourceforge.net/) functionality.

First of all we need sparse data format. I want to write parsers for a
number of common data formats.

We need some preprocessing utilities, such as normalization, feature
selection algorithms.
This part should be common for all of machine learning package.

Also package is need a number of classifiers. There are at least 2
state-of-art approaches in text classification and categorization:svm and
Bayesian  logistic regression. Svm has already been implemented in Scikits.
There are a lot of implementations of logistic regression. I am going to
integrate one of them (http://www.stat.rutgers.edu/~madigan/BBR/) into
scikits.

It is need interpretation module, which consists of processing
results(different metrics of quality), visualization, feature comparison.

There are common text collection (for instance
http://trec.nist.gov/data/reuters/reuters.html). I'll try to make work with
them absolutely simple.

After all, it is very important to write(or generate) reference
documentation and tutorial.

OK, that's all. I expect to hear your opinions. Particularly I want to see
answer of David Cournapeau, who is ,as I understand, maintainer of the learn
package.

-- 
Anton Slesarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20080316/9dbdc387/attachment.html>