[SciPy-dev] feedback on scipy.sparse

Wed Dec 12 01:59:53 EST 2007

I wanted to gather some feedback on the state of scipy.sparse and
possible changes and improvements.

===== Constructors =====

  Here are the current constructors for the various sparse classes:

  csr_matrix and csc_matrix
    def __init__(self, arg1, dims=None, dtype=None, copy=False):

  dok_matrix and lil_matrix
    def __init__(self, A=None, shape=None, dtype=None, copy=False):

  coo_matrix
    def __init__(self, arg1, dims=None, dtype=None):

  Empty matrices can now be constructed with xxx_matrix( (M,N) ) for
all formats.

 1) Should we prefer 'dims' over 'shape' or vice versa?  IMO 'shape'
is arguably more natural since all the types have a .shape attribute
 2) It would be nice if xxx_matrix( A ) always worked when A is a
sparse or dense matrix.  Does anyone object to this?  The
functionality is already present (though the various .toxxx() methods)
 3) When the user defines the dim (or shape) argument but the data
violates these bounds, what should happen?  IMO this merits an
exception, as opposed to expanding the dimensions to accommodate the
data.

===== sparse.py and sparse functions =====

sparse.py currently weighs in at nearly 3000 lines and will continue
growing.  I propose that we move the functions (e.g. spidentity(),
spdiags(), spkron(), etc. ) to a separate file.  Any comments or
proposals for the name of this file?  Would it be prudent to move the
classes into separate files also?

Also, these functions always return a specific sparse format.  For
example spidentity() always returns a csc_matrix, spkron() always
returns a coo_matrix, etc.  Currently, a user who wanted the identity
matrix in CSR format would have to do a CSC->CSR conversion on the
result of spidentity().  This is somewhat wasteful since the
spidentity() could easily have generated the CSR format instead.  It
would be better to allow the user to specify the desired return type
in the function call.  For example,
   spidentity(n, dtype='d',format='csr')
instead of
   spidentity(n, dtype='d').tocsr()
Sometimes a given function has a very natural return type.  For
instance, when we have a dia_matrix() implementation (I'm working on
one) then spdiags() would naturally use this format.  If the user
specified another type,  spdiags( ..., format='csr') then spdiags()
would, at worst, create the matrix in DIA format first and then
convert to CSR (with dia_matrix.tocsr() ).  I like this approach
because it allows the implementation to be clever when cleverness is
possible, but also doesn't place an undue burden on the programmer
when implementing a new method.  Furthermore, it shields the user from
internal implementation changes that might change the default return
format.

I propose the following policy.  Functions get a new parameter
'format' which defaults to None.  The default implies that the
function will return the matrix in whatever format is most natural
(and subject to change).  For example:
   spidentity(n, dtype='d',format=None)
might return a dia_matrix(), or a special identity matrix format in
the future.  At a minimum, the valid values of 'format' will include
the three-letter abbreviations of the currently supported sparse
matrix types (i.e. 'csr', 'csc', 'coo', 'lil', etc).  Comments?

Also, feel free to respond with any other comments related to scipy.sparse

-- 
Nathan Bell wnbell at gmail.com