[SciPy-Dev] RFC: sparse DOK array

Thu Apr 14 16:14:00 EDT 2016

Hi Nathaniel,

> - For general operations on n-dimensional sparse arrays, COO format
> seems like the obvious thing to use. The strategy I've envisioned is
> storing the data as N arrays of indices + 1 array of values (or
> possibly you'd want to pack these together into a single array of
> (struct containing N+1 values)), plus a bit of metadata indicating the
> current sort order -- so sort_order=(0, 1) would mean that the items
> are sorted lexicographically by row and then by column (so like CSC),
> and sort_order=(1, 0) would mean the opposite. Then operations like
> sum(axis=1) would be "sort by axis 1 (if not already there); then sum
> over each contiguous chunk of the data"; sum(axis=(0, 2)) would be
> "sort to make sure axes 0 and 2 are in the front (but I don't care in
> which order)", etc. It may be useful to use a stable sort, so that if
> you have sort_order=(0, 1, 2) and then sort by axis 2, you get
> sort_order=(2, 0, 1). For binary operations, you ensure that the two
> arrays have the same sort order (possibly re-sorting the smaller one),
> and then do a parallel iteration over both of them finding matching
> indices.

> - Instead of proliferating data types for all kinds of different
> storage formats, I'd really prefer to see a single data type that
> "just works". So for example, instead of having one storage format
> that's optimized for building matrices and one that's optimized for
> operations on matrices, you could have a single object which has:
>
>   - COO format representation
>   - a DOK of "pending updates"
>
> and writes would go into the DOK part, and at the beginning of each
> non-trivial operation you'd call self._get_coo_arrays() or whatever,
> which would do
>
>   if self._pending_updates:
>       # do a single batch reallocation and incorporate all the pending updates
>   return self._real_coo_arrays
>
> There are a lot of potentially viable variants on this using different
> data structures for the various pieces, different detailed
> reallocation strategies, etc., but you get the idea.

Thanks for sharing your perspective!

I really like the idea of the DOK buffer of pending updates.
I'll note that what the COO structure you describe looks similar to
CJ's SpArray which he mentioned in a parallel thread, and my MapArray
can serve as a buffer. Tacking them together seems easy, so one can
fairly easily test-drive what you describe.

Looking a bit further though, I doubt that one-size-fits-all data data
format is realistic. For one, there's a balance to strike between a
optimized 2D structure (e.g., PySparse's ll_mat) and a general n-D
structure, and there will always be at least one application which
requires one or another. (My crude experiments so far give about a
factor of 2--5 for CPU and/or memory footprint depending on general
tradeoffs.)

Ultimately then, I think it makes sense to consider three separate
layers: a python frontend, a low-level data structure and a middle
layer which transforms, e.g., the logic required by python's
__setitem__ into iterators of the low-level data structure.
For shaping up this middle layer, an std::map is fine, esp because
it's already implemented :-).

> For interface with external C/C++/Fortran code that wants CSR or CSC,
> I'd just have get_csr() and get_csc() methods that return a tuple of
> (indptr, indices, data), and not even try to implement any actual
> operations on this representation.
>
>> * Data types and casting rules. For now, I basically piggy-back on
>> numpy's rules.
>> There are several slightly different ones (numba has one?), and there might be
>> an opportunity to simplify the rules. OTOH, inventing one more subtly different
>> set of rules might be a bad idea.
>
> It would be really nice if you could just use numpy's own dtype
> objects and ufuncs instead of implementing your own. I don't know how
> viable that is right now, but I think it's worth trying, and if you
> can't then I'd like to know what part broke :-). Happy to help if you
> need some tips / further explanation.

For ufuncs, it's simple: it does not work ATM  :-( . I tried simple
conbinations of __array__, __array_prepare__ and __array_finalize__,
and they fail for a reason or another. Some details are available in
https://github.com/ev-br/sparr/issues/35
In a nutshell, these can probably made to work with some rather simple
modifications of to __array_prepare__/__array_wrap__, but I guess it's
not very likely given that __numpy_ufunc__ is supposed to come to
town, which I'm looking forward to :-).

For dtypes, I currently just dispatch on numpy typecodes. Do you mean
that, or do you mean actual PyArray_Descr objects?
Those I'm not using, and I do not see how to use them, likely because
I simply do not understand what they actually are :-). In that
department, I'd really really appreciate an explanation!

Cheers,

Evgeni