[SciPy-dev] sparse comments and questions

Sun Jul 8 06:51:40 EDT 2007

On 7/8/07, Nathan Bell <wnbell at gmail.com> wrote:
> Currently the sparse matrix storage formats (e.g.  sparse.csr_matrix )
> only support integer indices ( numpy.intc ) and floating point data
> (float32, float64, complex64, and complex128).  While this addresses
> the common usage, it would be nice to support other index and data
> formats also.  For example, when using a csr_matrix to represent a
> graph the dtype is mostly irrelevant, so smaller types like uint8
> would be desirable.  Likewise, on a 64-bit machine, one might like to
> use 32-bit indices (instead of the default C-int size) to economize on
> space.
>
> The sparsetools routines are fully templated, so it's a simple matter
> to support the other types there.  However, I don't know whether other
> subpackages that use sparse matrices would require changes.  Any
> thoughts?

Yes, good idea. I don't imagine preserving compatibility with existing
packages would be difficult.

> I have a few other questions and comments regarding the sparse package.
>
> 1) Does anyone depend on the .ftype attribute of the sparse formats?
> Can it be eliminated?  If not, can it be trapped by __getattr__() so
> that it doesn't become unsynchronized from data.dtype?

As far as I'm concerned it can be eliminated. But it's been around for
years (before I started working on sparse) and I have no idea what it
does. Travis?

> 3) Should we allow modifications to CSR and CSC matrices that force
> O(N) updates/reallocations.  For instance, adding a new value into a
> CSR matrix is essentially as expensive as building a new one from
> scratch.  Could we instead prohibit such modifications and raise an
> exception informing the user that lil_matrix is the proper format for
> such operations?  Note that changing an existing matrix value is not
> so costly (typically O(1)), so not all calls to __setitem__ would be
> banned.

Good point. There's an argument to be made for protecting users from
themselves. Might there be a need for the occasional O(N) tweak to an
existing CSR/C matrix? I don't know. I suggest we spin off this
functionality from __setitem__ into a separate module-level function
called modify_cs_matrix_for_masochists().

> 4) lil_matrix now supports more sophisticated slicing, but the
> implementation is pretty hairy and not especially fast.  Is anyone
> looking at this?  Does the slicing work correctly?  I fixed a bug
> related to negative indices, but I'm not completely confident in the
> robustness of my fix.

The implementation seems cleaner now as a result of your work. To have
more confidence in the slicing we also want more unit tests ...

> 5) Scipy's spdiags() works differently than the MATLAB version.
> Should it be changed to coincide?  If we later support a sparse
> diagonal format, should the output of spdiags change to dia_matrix
> instead of the present csc_matrix?

Yes, I think so.

> 6) When working with sparse matrices, do people expect the result to
> be of the same type?  For example, it's far more efficient to make
> transpose(csr_matrix) output a csc_matrix (as is currently done in
> SciPy) since that's a O(1) operation (as opposed to an O(n) cost
> otherwise).  So in general, should a method to return the result in
> most efficient output format, relying on the user to convert it to
> another format if so desired?  I.e. if you need a csr_matrix at a
> given point, then you should call .tocsr() to be sure.  If it's
> already a csr_matix, then no conversion will be done.

Yes, definitely.

> 7)  Where do functions that operate on sparse matrices belong?  For
> example, suppose I wrote a function to extract the lower diagonal
> entries of a matrix.  Should it be added to sparse.py (like spdiags)
> or somewhere else?

Yes, sparse.py is a good place for it.

-- Ed