[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Bruce Southey bsouthey at gmail.com
Mon Mar 24 10:49:14 EDT 2008


Hi,
The main problem that I have with using sparse input formats is that it
tends to ignore the complete picture. Typically  the algorithms are
typically not implemented to utilize sparse matrices and associated
techniques so the internals and outputs are not stored as a sparse
format.  This means that  the only gain is the apparent ease of input
because any storage advantage is lost if the input needs to be converted
to a dense format (especially if both copies are.

Using record arrays with both masked and sparse arrays would provide
probably address many concerns. Record arrays would allow labels like
'target' without forcing any order in the data storage, masked arrays
would allow for missing values in the input and sparse arrays would
potentially provide storage and algorithmic advantages.

Regards
Bruce

Anton Slesarev wrote:
>
>
> On Mon, Mar 24, 2008 at 3:55 PM, Nathan Bell <wnbell at gmail.com
> <mailto:wnbell at gmail.com>> wrote:
> >
> > On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
> > <david at ar.media.kyoto-u.ac.jp <mailto:david at ar.media.kyoto-u.ac.jp>>
> wrote:
> > >  > I mean that I have 1 million text pages with 150 thousands
> different
> > >  > words(features), but  each  document has only  small part of
> all 150
> > >  > thousands world. And if  I  use simple matrix it will be huge.
> But if
> > >  > i use sparse format such as libsvm data format than input file
> will be
> > >  > much smaller. I don't know how to do it with scikits now. I
> know how
> > >  > to do it with libsvm and many other tools, but I want to make scipy
> > >  > appropriate for this task. And I want make a tutorial in which one
> > >  > paragraph will be about "Sparse data".
> > >
> > >  I understand sparse, I don't understand why you cannot use existing
> > >  scipy implementation :)
> >
> > Anton, can you describe libsvm's sparse format?  I think it's highly
> > likely that scipy.sparse supports the functionality you need.
> >
> > Currently you can load a sparse matrix from disk using MatrixMarket
> > format (scipy.io.mmread)  or  MATLAB format (scipy.io.loadmat).  Both
> > of these functions should be fast enough for your 150K by 1M example.
> >
> > FWIW the MATLAB files will generally be smaller and load faster.
> >
> > --
> > Nathan Bell wnbell at gmail.com <mailto:wnbell at gmail.com>
> > http://graphics.cs.uiuc.edu/~wnbell/
> <http://graphics.cs.uiuc.edu/%7Ewnbell/>
> >
> >
> >
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipy-dev at scipy.org <mailto:Scipy-dev at scipy.org>
> > http://projects.scipy.org/mailman/listinfo/scipy-dev
> >
>
> libsvm format:
>
>
>
> "libsvm uses the so called "sparse" format where zero values do not
> need to be stored. Hence a data with attributes 1 0 2 0
>
> is represented as 1:1 3:2"
>
>
> I understand that it is possible to use scipy.sparse and something
> else but what about if I need to make feature selection or some
> specific normalization? I think that we can integrate this
> procedure(with scipy.sparse and reading huge files) to dataset class
> in scikits.learn.
>
>
>
>
>
>
> -- 
> Anton Slesarev
> ------------------------------------------------------------------------
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>   




More information about the SciPy-Dev mailing list