[SciPy-Dev] RFC: sparse DOK array

Wed Mar 30 01:28:55 EDT 2016

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski
<evgeny.burovskiy at gmail.com> wrote:
> Hi,
>
> TL;DR: Here's a draft implementation of a 'sparse array', suitable for
> incremental
> assembly. Comments welcome!
> https://github.com/ev-br/sparr
>
> Scipy sparse matrices are generally great. One area where they are not
> very great
> is the formats suitable for incremental assembly (esp LIL and DOK
> matrices). This
> has been discussed for a while, see, for instance,
> https://github.com/scipy/scipy/pull/2908#discussion_r16955546
>
> I spent some time to see how hard is it to build a dictionary-of-keys
> data structure without Python overhead. Turns out, not very hard.
>
> Also, when constructing a sparse data structure now, it seems to make sense
> to make it a sparse *array* rather than *matrix*, so that it supports both
> elementwise and matrix multiplication.
>
> A draft implementation is available at https://github.com/ev-br/sparr and
> I'd be grateful for any comments, large or small.
>
> Some usage examples are in the README file which Github shows at the
> front page above.
>
> First and foremost, I'd like to gauge interest in the community ;-).
> Does it actually make sense? Would you use such a data structure? What is
> missing in the current version?

This is really cool! The continuing importance of scipy.sparse is
basically the only reason np.matrix hasn't been deprecated yet.

I've noodled around a little with designs for similar things at
various points, so a few general comments based on that that you can
take or leave :-)

- For general operations on n-dimensional sparse arrays, COO format
seems like the obvious thing to use. The strategy I've envisioned is
storing the data as N arrays of indices + 1 array of values (or
possibly you'd want to pack these together into a single array of
(struct containing N+1 values)), plus a bit of metadata indicating the
current sort order -- so sort_order=(0, 1) would mean that the items
are sorted lexicographically by row and then by column (so like CSC),
and sort_order=(1, 0) would mean the opposite. Then operations like
sum(axis=1) would be "sort by axis 1 (if not already there); then sum
over each contiguous chunk of the data"; sum(axis=(0, 2)) would be
"sort to make sure axes 0 and 2 are in the front (but I don't care in
which order)", etc. It may be useful to use a stable sort, so that if
you have sort_order=(0, 1, 2) and then sort by axis 2, you get
sort_order=(2, 0, 1). For binary operations, you ensure that the two
arrays have the same sort order (possibly re-sorting the smaller one),
and then do a parallel iteration over both of them finding matching
indices.

- Instead of proliferating data types for all kinds of different
storage formats, I'd really prefer to see a single data type that
"just works". So for example, instead of having one storage format
that's optimized for building matrices and one that's optimized for
operations on matrices, you could have a single object which has:

  - COO format representation
  - a DOK of "pending updates"

and writes would go into the DOK part, and at the beginning of each
non-trivial operation you'd call self._get_coo_arrays() or whatever,
which would do

  if self._pending_updates:
      # do a single batch reallocation and incorporate all the pending updates
  return self._real_coo_arrays

There are a lot of potentially viable variants on this using different
data structures for the various pieces, different detailed
reallocation strategies, etc., but you get the idea.

For interface with external C/C++/Fortran code that wants CSR or CSC,
I'd just have get_csr() and get_csc() methods that return a tuple of
(indptr, indices, data), and not even try to implement any actual
operations on this representation.

> * Data types and casting rules. For now, I basically piggy-back on
> numpy's rules.
> There are several slightly different ones (numba has one?), and there might be
> an opportunity to simplify the rules. OTOH, inventing one more subtly different
> set of rules might be a bad idea.

It would be really nice if you could just use numpy's own dtype
objects and ufuncs instead of implementing your own. I don't know how
viable that is right now, but I think it's worth trying, and if you
can't then I'd like to know what part broke :-). Happy to help if you
need some tips / further explanation.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org