[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Wed Mar 19 07:39:27 EDT 2008

On Tue, Mar 18, 2008 at 12:05 PM, Anton Slesarev
<slesarev.anton at gmail.com> wrote:
>
> Scipy support work with sparse matrix. It is true. But I mean that learn
> should support sparse format in parsers. To learn svm it should be enough to
> use syntax like:
>
> data = LoadSparseData('data.file')
> s = svm()
> s.train(data)
>
> I hope to implement such kind of syntax. User needn't know what exactly data
> is. He should just use it.

I have no first hand experience with it, but apparently the Shogun
toolbox provides some support for sparse data. From their homepage:

http://www.shogun-toolbox.org/
"The machine learning toolbox's focus is on large scale kernel methods
and especially on Support Vector Machines (SVM) [1]. It provides a
generic SVM object interfacing to several different SVM
implementations, among them the state of the art LibSVM [2], SVMLight,
[3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a
variety of kernels.
...
The input feature-objects can be dense, sparse or strings and of type
int/short/double/char and can be converted into different feature
types. Chains of preprocessors (e.g. substracting the mean) can be
attached to each feature object allowing for on-the-fly
pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave
and Python."

I hope this helps.

Carlos

>
>
> >
> >
> > BBR will be welcomed by a lot of people, when you will implement it, code
> everythign as generic as possible. For instance, feature selection (perhaps
> extraction as well) could be used by other algorithms (even SVMs), so it
> should be as generic as possible (feature comparison should be explained, is
> it in terms of classification results ?).
>
> Yes, it's a good wish. But I think that main part of bbr is classifier,
> feature selection can be implemented in separate module. It is not need to
> integrate to scikits their implementation of feature selection algorithm.
>
> >
> >
> > I'd like to add manifold learning tools (this can be thought as some
> feature extraction tools, visualization, ...) which could benefit from your
> approach and vice-versa.
> >
> That's great!
>
>
> >
> > Matthieu
> >
> >
> > 2008/3/16, Anton Slesarev <slesarev.anton at gmail.com>:
> > >
> > >
> > >
> > > Hi.
> > >
> > > I'm going to describe what problems I see in current version of
> scikits.learn. After that I'll write what I want to improve during Google
> Summer of Code. In my last letter I tried to numerate some limitations in
> other open-source frameworks such as PyML and Orange.
> > >
> > > Let's start about Scikits.learn.
> > >
> > > First of all is a lack of documentation. I can find nothing beside David
> Cournapeau proposal on google Summer of Code. Nothing in wiki and nothing in
> maillist. There are few examples in svm, of course. But it is very hard use
> only examples. I can't find parser of different data formats. Only for
> datasets. As I understand datasets don't support sparse data format. There
> is no common structure in ML package. It has scattered modules such as svm,
> em, ann, but no main idea.
> > >
> > > If I mistake in understanding current state of affair you can correct
> me.
> > >
> > > Well, now about what I want to change.
> > >
> > > I am going to make learn package appropriate for text classification.
> Also I want to copy most of PyML (pyml.sourceforge.net/) functionality.
> > >
> > > First of all we need sparse data format. I want to write parsers for a
> number of common data formats.
> > >
> > > We need some preprocessing utilities, such as normalization, feature
> selection algorithms.
> > > This part should be common for all of machine learning package.
> > >
> > > Also package is need a number of classifiers. There are at least 2
> state-of-art approaches in text classification and categorization:svm and
> Bayesian  logistic regression. Svm has already been implemented in Scikits.
> There are a lot of implementations of logistic regression. I am going to
> integrate one of them (http://www.stat.rutgers.edu/~madigan/BBR/) into
> scikits.
> > >
> > > It is need interpretation module, which consists of processing
> results(different metrics of quality), visualization, feature comparison.
> > >
> > > There are common text collection (for instance
> http://trec.nist.gov/data/reuters/reuters.html). I'll try to make work with
> them absolutely simple.
> > >
> > > After all, it is very important to write(or generate) reference
> documentation and tutorial.
> > >
> > > OK, that's all. I expect to hear your opinions. Particularly I want to
> see answer of David Cournapeau, who is ,as I understand, maintainer of the
> learn package.
> > >
> > >
> > >
> > > --
> > > Anton Slesarev
> > >
> > > _______________________________________________
> > > Scipy-dev mailing list
> > > Scipy-dev at scipy.org
> > > http://projects.scipy.org/mailman/listinfo/scipy-dev
> > >
> > >
> >
> >
> >
> > --
> > French PhD student
> > Website : http://matthieu-brucher.developpez.com/
> > Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> > LinkedIn : http://www.linkedin.com/in/matthieubrucher
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipy-dev at scipy.org
> > http://projects.scipy.org/mailman/listinfo/scipy-dev
> >
> >
>
>
>
> --
> Anton Slesarev
> _______________________________________________
>  Scipy-dev mailing list
>  Scipy-dev at scipy.org
>  http://projects.scipy.org/mailman/listinfo/scipy-dev
>
>