[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Matthieu Brucher matthieu.brucher at gmail.com
Sun Mar 16 08:58:28 EDT 2008


Hi,

I completely agree with you, there should be more documentation, but I still
don't see your point with sparse data format. Scipy proposes this, doesn't
it ?

BBR will be welcomed by a lot of people, when you will implement it, code
everythign as generic as possible. For instance, feature selection (perhaps
extraction as well) could be used by other algorithms (even SVMs), so it
should be as generic as possible (feature comparison should be explained, is
it in terms of classification results ?).

I'd like to add manifold learning tools (this can be thought as some feature
extraction tools, visualization, ...) which could benefit from your approach
and vice-versa.

Matthieu

2008/3/16, Anton Slesarev <slesarev.anton at gmail.com>:
>
> Hi.
>
> I'm going to describe what problems I see in current version of
> scikits.learn. After that I'll write what I want to improve during Google
> Summer of Code. In my last letter I tried to numerate some limitations in
> other open-source frameworks such as PyML and Orange.
>
> Let's start about Scikits.learn.
>
> First of all is a lack of documentation. I can find nothing beside David
> Cournapeau proposal on google Summer of Code. Nothing in wiki and nothing in
> maillist. There are few examples in svm, of course. But it is very hard use
> only examples. I can't find parser of different data formats. Only for
> datasets. As I understand datasets don't support sparse data format. There
> is no common structure in ML package. It has scattered modules such as svm,
> em, ann, but no main idea.
>
> If I mistake in understanding current state of affair you can correct me.
>
> Well, now about what I want to change.
>
> I am going to make learn package appropriate for text classification. Also
> I want to copy most of PyML (pyml.sourceforge.net/) functionality.
>
> First of all we need sparse data format. I want to write parsers for a
> number of common data formats.
>
> We need some preprocessing utilities, such as normalization, feature
> selection algorithms.
> This part should be common for all of machine learning package.
>
> Also package is need a number of classifiers. There are at least 2
> state-of-art approaches in text classification and categorization:svm and
> Bayesian  logistic regression. Svm has already been implemented in Scikits.
> There are a lot of implementations of logistic regression. I am going to
> integrate one of them (http://www.stat.rutgers.edu/~madigan/BBR/<http://www.stat.rutgers.edu/%7Emadigan/BBR/>)
> into scikits.
>
> It is need interpretation module, which consists of processing
> results(different metrics of quality), visualization, feature comparison.
>
> There are common text collection (for instance
> http://trec.nist.gov/data/reuters/reuters.html). I'll try to make work
> with them absolutely simple.
>
> After all, it is very important to write(or generate) reference
> documentation and tutorial.
>
> OK, that's all. I expect to hear your opinions. Particularly I want to see
> answer of David Cournapeau, who is ,as I understand, maintainer of the learn
> package.
>
>
>
> --
> Anton Slesarev
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>
>


-- 
French PhD student
Website : http://matthieu-brucher.developpez.com/
Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn : http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20080316/bc55a6c9/attachment.html>


More information about the SciPy-Dev mailing list