[SciPy-Dev] RFC: sparse DOK array

Mon Mar 28 12:13:25 EDT 2016

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy at gmail.com
> wrote:

> First and foremost, I'd like to gauge interest in the community ;-).
> Does it actually make sense? Would you use such a data structure? What is
> missing in the current version?
>

This looks awesome, and makes complete sense to me! In particular, xarray
could really use an n-dimensional sparse structure.

A few other things small things I'd like to see:
- Support for slicing, even if it's expensive.
- A strict way to set the shape without automatic expansion, if desired
(e.g., if shape is provided in the constructor).
- Default to the dtype of the fill_value. NumPy does this for np.full.

> Short to medium term, some issues I see are:
>
> * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index
> types.
> I wonder if this is purely backwards compatibility. Naively, it seems to me
> that a new class could just always use 64-bit indices, but this might
> be too naive?
>
> * Data types and casting rules. For now, I basically piggy-back on
> numpy's rules.
> There are several slightly different ones (numba has one?), and there
> might be
> an opportunity to simplify the rules. OTOH, inventing one more subtly
> different
> set of rules might be a bad idea.
>

Yes, please follow NumPy.

* "Object" dtype. So far, there isn't one. I wonder if it's needed or having
> only numeric types would be enough.
>

This would be marginally useful -- eventually someone is going to want to
store some strings in a sparse array, and NumPy doesn't handle this very
well. Thus pandas, h5py and xarray all end up using dtype=object for
variable length strings. (pandas/xarray even take the monstrous approach of
using np.nan as a sentinel missing value.)

> * Interoperation with numpy arrays and other sparse matrices. I guess
> __numpy_ufunc__ would *the* solution here, when available.
> For now, I do something simple based on special-casing and
> __array_priority__.
> Sparse matrices almost work, but there are glitches.
>

Yes, __array_priority__ is about the best we can do now.

You could actually use a mix of __array_prepare__ and __array_wrap__ to
make (non-generalized) ufuncs work, e.g., for functions like np.sin:

- In __array_prepare__, return the non-fill values of the array
concatenated with the fill value.
- In __array_wrap__, reshape all but the last element to build a new sparse
array, using the last element for the new fill value.

This would be a neat trick and get you most of what you could hope for from
__numpy_ufunc__.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20160328/0fa77b90/attachment.html>