[SciPy-dev] Sparse data structures.

Tue Apr 8 00:54:05 EDT 2008

On Mon, Apr 7, 2008 at 2:15 PM, Viral Shah
<vshah at interactivesupercomputing.com> wrote:

>  Offering 5 choices to the naive programmer who doesn't understand the
>  details about sparse matrix implementations may make Scipy look
>  daunting. I am wondering whats the best way to address this ? I

Actually, there are 7 choices now :)
http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/sparse

I've made an attempt to explain the rationale for each format in the
docstrings of each class.  Ideally we'd have a nice wiki page to
describe the formats in greater detail.  I don't have a great deal of
time at my disposal currently, but I would contribute to such a page
when I have time.

>  summarize my understanding of the different formats below. I'd love to
>  have a conversation with someone on how to make this useful in a
>  coherent way (one default data structure, for example), or at the very
>  least, update the documentation with some information to guide someone.

Most of the formats are directly related to those in SPARSKIT:
http://www-users.cs.umn.edu/~saad/software/SPARSKIT/sparskit.html

You may find the SPARSKIT documentation helpful:
http://www-users.cs.umn.edu/~saad/software/SPARSKIT/paper.ps

I don't know of a "one size fits all" sparse format, so having a
default would be dangerous.  IMO MATLAB's use of CSC by default *is*
simple, but that simplicity comes at the cost of completely miserable
performance in many situations.

>  For row/col indexing, CSR and CSC formats can be decent - especially
>  if you are extracting columns from CSC and rows from CSR. Random
>  access is decent, but not the greatest. These formats are also good
>  for matrix-vector multiplication, the workhorse of many iterative
>  solvers - but one could use a library like OSKI for that. CSR/CSC are
>  terrible for assembly and assignment. They can be decent for
>  concatenation along the preferred dimension.
>
>  A dictionary representation is great for random access, assembly, and
>  concatenation, but not very efficient for submatrix indexing and
>  assignment. Co-ordinate representation can be good for operations,
>  given that the right rule for combining duplicates is used. The good
>  thing about both these representations is that they can be extended to
>  N-d sparse arrays.
>
>  The list of lists approach lies somewhere in the middle. I am assuming
>  these lists are kept sorted. The main drawback is the excessive memory
>  usage to store all the pointers. I haven't yet looked at the code, and
>  I could be wrong.

You assessment looks correct to me.  Currently LIL and DOK are a
little too slow for constructing anything large-scale IMO (despite
their asymptotic complexity).

I highly recommend that you install the development versions of NumPy
and SciPy.  The sparse module has seen a number of improvements since
the last release.

-- 
Nathan Bell wnbell at gmail.com
http://graphics.cs.uiuc.edu/~wnbell/