[SciPy-dev] Sparse data structures.
Nathan Bell
wnbell at gmail.com
Tue Apr 8 00:54:05 EDT 2008
On Mon, Apr 7, 2008 at 2:15 PM, Viral Shah
<vshah at interactivesupercomputing.com> wrote:
> Offering 5 choices to the naive programmer who doesn't understand the
> details about sparse matrix implementations may make Scipy look
> daunting. I am wondering whats the best way to address this ? I
Actually, there are 7 choices now :)
http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/sparse
I've made an attempt to explain the rationale for each format in the
docstrings of each class. Ideally we'd have a nice wiki page to
describe the formats in greater detail. I don't have a great deal of
time at my disposal currently, but I would contribute to such a page
when I have time.
> summarize my understanding of the different formats below. I'd love to
> have a conversation with someone on how to make this useful in a
> coherent way (one default data structure, for example), or at the very
> least, update the documentation with some information to guide someone.
Most of the formats are directly related to those in SPARSKIT:
http://www-users.cs.umn.edu/~saad/software/SPARSKIT/sparskit.html
You may find the SPARSKIT documentation helpful:
http://www-users.cs.umn.edu/~saad/software/SPARSKIT/paper.ps
I don't know of a "one size fits all" sparse format, so having a
default would be dangerous. IMO MATLAB's use of CSC by default *is*
simple, but that simplicity comes at the cost of completely miserable
performance in many situations.
> For row/col indexing, CSR and CSC formats can be decent - especially
> if you are extracting columns from CSC and rows from CSR. Random
> access is decent, but not the greatest. These formats are also good
> for matrix-vector multiplication, the workhorse of many iterative
> solvers - but one could use a library like OSKI for that. CSR/CSC are
> terrible for assembly and assignment. They can be decent for
> concatenation along the preferred dimension.
>
> A dictionary representation is great for random access, assembly, and
> concatenation, but not very efficient for submatrix indexing and
> assignment. Co-ordinate representation can be good for operations,
> given that the right rule for combining duplicates is used. The good
> thing about both these representations is that they can be extended to
> N-d sparse arrays.
>
> The list of lists approach lies somewhere in the middle. I am assuming
> these lists are kept sorted. The main drawback is the excessive memory
> usage to store all the pointers. I haven't yet looked at the code, and
> I could be wrong.
You assessment looks correct to me. Currently LIL and DOK are a
little too slow for constructing anything large-scale IMO (despite
their asymptotic complexity).
I highly recommend that you install the development versions of NumPy
and SciPy. The sparse module has seen a number of improvements since
the last release.
--
Nathan Bell wnbell at gmail.com
http://graphics.cs.uiuc.edu/~wnbell/
More information about the SciPy-Dev
mailing list