[SciPy-Dev] sparse vectors / matrices / tensors

David Cournapeau cournape at gmail.com
Tue Sep 20 18:23:44 EDT 2011


On Tue, Sep 20, 2011 at 6:12 PM, Yannick Versley <yversley at gmail.com> wrote:
> David,
>
> I think the most important tradeoffs are about
>
> - writable vs. read-only; where I assume that writes are batched (so you can
> put
> the contents of writes somewhere and transition it to a compact read-only
> form
> at the next read access)
>
> - only access in lexical order vs. multidimensional access; my own
> implementation
> assumes always access in lexical order (or enumeration in that order), but
> other people (Dan Goodman) have written about adding auxiliary indices to
> facilitate
> column-wise access.
>
> My guess is that there will be something that works well for a commonly
> encountered
> subset of those requirements, but nothing that fits in a one-size-fits-all
> fashion
> if one also considers multidimensional access. (Keeping both the matrix and
> its
> transpose enough is sometimes ok, but fails when you want to modify the
> values).
>>
>> > - scipy.sparse is limited to matrices and has no vectors or order-k
>> > tensors
>> > - LIL and DOK are not really efficient or convenient data structures to
>> > create
>> >   sparse matrices (my own library basically keeps a list of unordered
>> > COO
>> > items
>> >   and compacts/sorts them when the matrix is actually used as a matrix)
>>
>> I think those statementds are pretty uncontroversial. Implementing a
>> sparse array (not limited to 1/2 dimensions) is a non trivial
>> endeavour because the efficient 2d representations (csr/csc) do not
>> generalize well.
>
> Basically, you have dense access where you compute the offset
> of the data from the address itself, and then you have COO where
> you store (address,data) tuples (either separately or together).
> CSR uses a mixture where you use a computed offset for a prefix of
> the address and then store (address,data) tuples for a suffix of the
> address. You could generalize this by always use a computed offset
> for the first dimension of the address and then use (address,data)
> for the rest (which gives you diminishing returns in terms of space
> saved); or you could use some multilevel scheme where you first use
> a computed offset and then look up N-2 (address,offset) pairs (with
> the range end being specified by the offset in the following tuple)
> followed by one last group of (address,data) tuples.
>>
>> I started looking into the literature about those issues, and found a
>> few interesting data structures, such as the UB-tree
>> (http://www.scholarpedia.org/article/B-tree_and_UB-tree). I am
>>
>> actually currently looking into implementing this for a basic sparse
>> tensor to get an idea of its efficiency (memory and speed-wise).
>
> Neat idea. Space-filling curves should give a decent tradeoff for
> multidimensional access. (Using B-trees vs batching up writes is
> yet another case of "it depends" - which would suggest a common
> interface rather than a single common data structure for all sparse
> matrices).

The problem of just defining a common interface is that
implementations will likely only implement some disjoint subsets of
each other (i.e. the current situation), unless the interface is
pretty minimalistic. I don't see such an interface to be very useful
while being data representation independent.

The thing I really like with the UB-tree is that the storage layer is
dimension independent, and that it supports all the basic operations
we need (fast insertion, deletion and random access, and
memory-efficient iteration).

cheers,

David



More information about the SciPy-Dev mailing list