[SciPy-dev] feedback on scipy.sparse

Wed Dec 12 03:28:45 EST 2007

Hi Nathan

On Wed, Dec 12, 2007 at 12:59:53AM -0600, Nathan Bell wrote:
> I wanted to gather some feedback on the state of scipy.sparse and
> possible changes and improvements.
> 
> 
> ===== Constructors =====
> 
>   Here are the current constructors for the various sparse classes:
> 
>   csr_matrix and csc_matrix
>     def __init__(self, arg1, dims=None, dtype=None, copy=False):
> 
>   dok_matrix and lil_matrix
>     def __init__(self, A=None, shape=None, dtype=None, copy=False):
> 
>   coo_matrix
>     def __init__(self, arg1, dims=None, dtype=None):
> 
>   Empty matrices can now be constructed with xxx_matrix( (M,N) ) for
> all formats.
> 
>  1) Should we prefer 'dims' over 'shape' or vice versa?  IMO 'shape'
> is arguably more natural since all the types have a .shape attribute

+1

>  3) When the user defines the dim (or shape) argument but the data
> violates these bounds, what should happen?  IMO this merits an
> exception, as opposed to expanding the dimensions to accommodate the
> data.

I agree -- silently modifying the shape to allow such assignments is a
recipe for disaster.

> sparse.py currently weighs in at nearly 3000 lines and will continue
> growing.  I propose that we move the functions (e.g. spidentity(),
> spdiags(), spkron(), etc. ) to a separate file.  Any comments or
> proposals for the name of this file?  Would it be prudent to move the
> classes into separate files also?

I'd like to see the separate classes moving into their own files.

Eye, diags etc. make use of specific properties of each array type, so
I wonder whether those operations shouldn't be implemented as static
class methods?

> Also, these functions always return a specific sparse format.  For
> example spidentity() always returns a csc_matrix, spkron() always
> returns a coo_matrix, etc.  Currently, a user who wanted the identity
> matrix in CSR format would have to do a CSC->CSR conversion on the
> result of spidentity().  This is somewhat wasteful since the
> spidentity() could easily have generated the CSR format instead.  It
> would be better to allow the user to specify the desired return type
> in the function call.  For example,
>    spidentity(n, dtype='d',format='csr')
> instead of
>    spidentity(n, dtype='d').tocsr()
> Sometimes a given function has a very natural return type.  For
> instance, when we have a dia_matrix() implementation (I'm working on
> one) then spdiags() would naturally use this format.  If the user
> specified another type,  spdiags( ..., format='csr') then spdiags()
> would, at worst, create the matrix in DIA format first and then
> convert to CSR (with dia_matrix.tocsr() ).  I like this approach
> because it allows the implementation to be clever when cleverness is
> possible, but also doesn't place an undue burden on the programmer
> when implementing a new method.  Furthermore, it shields the user from
> internal implementation changes that might change the default return
> format.

This sounds like the right approach: in my mind, any operation should
take the least amount of time possible.  I.e. if a function needs to
convert a sparse array to csr internally, then don't bother converting
it back to the original type when returning.  A user should string
together a whole bunch of operations, and only at the end do a single
conversion to the required array type.

> I propose the following policy.  Functions get a new parameter
> 'format' which defaults to None.  The default implies that the
> function will return the matrix in whatever format is most natural
> (and subject to change).  For example:
>    spidentity(n, dtype='d',format=None)
> might return a dia_matrix(), or a special identity matrix format in
> the future.  At a minimum, the valid values of 'format' will include
> the three-letter abbreviations of the currently supported sparse
> matrix types (i.e. 'csr', 'csc', 'coo', 'lil', etc).  Comments?

Sounds good!

> Also, feel free to respond with any other comments related to
> scipy.sparse

At the moment, IIRC, functionality for different kinds of sparse
arrays are located in the same classes, separated with if's.  I would
like to see the different classes pulled completely apart, so the only
overlap is in common functionality.

I'd also like to discuss the in-place memory assignment policy.  When
do we copy on write, and when do we return views?  For example, taking
a slice out of a lil_matrix returns a new sparse array.  It is
*possible* to create a view, but it gets a bit tricky.  If each array
had an "origin" property, such views could be trivially constructed,
but it still does not cater for slices like x[::2].

Regards
Stéfan