[SciPy-dev] feedback on scipy.sparse
Stefan van der Walt
stefan at sun.ac.za
Wed Dec 12 03:28:45 EST 2007
Hi Nathan
On Wed, Dec 12, 2007 at 12:59:53AM -0600, Nathan Bell wrote:
> I wanted to gather some feedback on the state of scipy.sparse and
> possible changes and improvements.
>
>
> ===== Constructors =====
>
> Here are the current constructors for the various sparse classes:
>
> csr_matrix and csc_matrix
> def __init__(self, arg1, dims=None, dtype=None, copy=False):
>
> dok_matrix and lil_matrix
> def __init__(self, A=None, shape=None, dtype=None, copy=False):
>
> coo_matrix
> def __init__(self, arg1, dims=None, dtype=None):
>
> Empty matrices can now be constructed with xxx_matrix( (M,N) ) for
> all formats.
>
> 1) Should we prefer 'dims' over 'shape' or vice versa? IMO 'shape'
> is arguably more natural since all the types have a .shape attribute
+1
> 3) When the user defines the dim (or shape) argument but the data
> violates these bounds, what should happen? IMO this merits an
> exception, as opposed to expanding the dimensions to accommodate the
> data.
I agree -- silently modifying the shape to allow such assignments is a
recipe for disaster.
> sparse.py currently weighs in at nearly 3000 lines and will continue
> growing. I propose that we move the functions (e.g. spidentity(),
> spdiags(), spkron(), etc. ) to a separate file. Any comments or
> proposals for the name of this file? Would it be prudent to move the
> classes into separate files also?
I'd like to see the separate classes moving into their own files.
Eye, diags etc. make use of specific properties of each array type, so
I wonder whether those operations shouldn't be implemented as static
class methods?
> Also, these functions always return a specific sparse format. For
> example spidentity() always returns a csc_matrix, spkron() always
> returns a coo_matrix, etc. Currently, a user who wanted the identity
> matrix in CSR format would have to do a CSC->CSR conversion on the
> result of spidentity(). This is somewhat wasteful since the
> spidentity() could easily have generated the CSR format instead. It
> would be better to allow the user to specify the desired return type
> in the function call. For example,
> spidentity(n, dtype='d',format='csr')
> instead of
> spidentity(n, dtype='d').tocsr()
> Sometimes a given function has a very natural return type. For
> instance, when we have a dia_matrix() implementation (I'm working on
> one) then spdiags() would naturally use this format. If the user
> specified another type, spdiags( ..., format='csr') then spdiags()
> would, at worst, create the matrix in DIA format first and then
> convert to CSR (with dia_matrix.tocsr() ). I like this approach
> because it allows the implementation to be clever when cleverness is
> possible, but also doesn't place an undue burden on the programmer
> when implementing a new method. Furthermore, it shields the user from
> internal implementation changes that might change the default return
> format.
This sounds like the right approach: in my mind, any operation should
take the least amount of time possible. I.e. if a function needs to
convert a sparse array to csr internally, then don't bother converting
it back to the original type when returning. A user should string
together a whole bunch of operations, and only at the end do a single
conversion to the required array type.
> I propose the following policy. Functions get a new parameter
> 'format' which defaults to None. The default implies that the
> function will return the matrix in whatever format is most natural
> (and subject to change). For example:
> spidentity(n, dtype='d',format=None)
> might return a dia_matrix(), or a special identity matrix format in
> the future. At a minimum, the valid values of 'format' will include
> the three-letter abbreviations of the currently supported sparse
> matrix types (i.e. 'csr', 'csc', 'coo', 'lil', etc). Comments?
Sounds good!
> Also, feel free to respond with any other comments related to
> scipy.sparse
At the moment, IIRC, functionality for different kinds of sparse
arrays are located in the same classes, separated with if's. I would
like to see the different classes pulled completely apart, so the only
overlap is in common functionality.
I'd also like to discuss the in-place memory assignment policy. When
do we copy on write, and when do we return views? For example, taking
a slice out of a lil_matrix returns a new sparse array. It is
*possible* to create a view, but it gets a bit tricky. If each array
had an "origin" property, such views could be trivially constructed,
but it still does not cater for slices like x[::2].
Regards
Stéfan
More information about the SciPy-Dev
mailing list