[SciPy-Dev] RFC: sparse DOK array

Evgeni Burovski evgeny.burovskiy at gmail.com
Mon Mar 28 06:29:26 EDT 2016


Hi,

TL;DR: Here's a draft implementation of a 'sparse array', suitable for
incremental
assembly. Comments welcome!
https://github.com/ev-br/sparr

Scipy sparse matrices are generally great. One area where they are not
very great
is the formats suitable for incremental assembly (esp LIL and DOK
matrices). This
has been discussed for a while, see, for instance,
https://github.com/scipy/scipy/pull/2908#discussion_r16955546

I spent some time to see how hard is it to build a dictionary-of-keys
data structure without Python overhead. Turns out, not very hard.

Also, when constructing a sparse data structure now, it seems to make sense
to make it a sparse *array* rather than *matrix*, so that it supports both
elementwise and matrix multiplication.

A draft implementation is available at https://github.com/ev-br/sparr and
I'd be grateful for any comments, large or small.

Some usage examples are in the README file which Github shows at the
front page above.

First and foremost, I'd like to gauge interest in the community ;-).
Does it actually make sense? Would you use such a data structure? What is
missing in the current version?

Short to medium term, some issues I see are:

* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types.
I wonder if this is purely backwards compatibility. Naively, it seems to me
that a new class could just always use 64-bit indices, but this might
be too naive?

* Data types and casting rules. For now, I basically piggy-back on
numpy's rules.
There are several slightly different ones (numba has one?), and there might be
an opportunity to simplify the rules. OTOH, inventing one more subtly different
set of rules might be a bad idea.

* "Object" dtype. So far, there isn't one. I wonder if it's needed or having
only numeric types would be enough.

* Interoperation with numpy arrays and other sparse matrices. I guess
__numpy_ufunc__ would *the* solution here, when available.
For now, I do something simple based on special-casing and __array_priority__.
Sparse matrices almost work, but there are glitches.

This is just a non-exhaustive list of things I'm thinking about. If there's
something else (and I'm sure there is plenty!) I'm all ears --- either
in this email thread or in the issue tracker.
If someone feels like commenting on implementation details, sure, none of those
is cast in stone and plumbing is definitely in a bit of a flux.

Cheers,

Evgeni



More information about the SciPy-Dev mailing list