[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Mon May 14 16:54:09 EDT 2012

On Mon, May 14, 2012 at 5:31 PM, mark florisson
<markflorisson88 at gmail.com>wrote:

> On 12 May 2012 22:55, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no>
> wrote:
> > On 05/11/2012 03:37 PM, mark florisson wrote:
> >>
> >> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
> >>  wrote:
> >>>
> >>> (NumPy devs: I know, I get too many ideas. But this time I *really*
> >>> believe
> >>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
> >>> not
> >>> going to be without manpower; and as his mentor I'd pitch in too here
> and
> >>> there.)
> >>>
> >>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
> >>> don't
> >>> want to micro-manage your GSoC, just have your take.)
> >>>
> >>> Travis, thank you very much for those good words in the "NA-mask
> >>> interactions..." thread. It put most of my concerns away. If anybody is
> >>> leaning towards for opaqueness because of its OOP purity, I want to
> refer
> >>> to
> >>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
> >>> different OOP array libraries, neither of which is able to out-compete
> >>> the
> >>> other. Meanwhile the rest of the world happily cooperates using
> pointers,
> >>> strides, CSR and CSC.
> >>>
> >>> Now, there are limits to what you can do with strides and pointers.
> >>> Noone's
> >>> denying the need for more. In my mind that's an API where you can do
> >>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
> >>> array;
> >>> but it might be something slightly different.
> >>>
> >>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
> >>> NumPy C
> >>> API to deal with this issue.
> >>>
> >>> What we need is duck-typing/polymorphism at the C level. If you keep
> >>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
> >>> relationship: One provider of array technology, multiple consumers
> (with
> >>> hooks, I'm sure, but all implementations of the hook concept in the
> NumPy
> >>> world I've seen so far are a total disaster!).
> >>>
> >>> What I think we need instead is something like PEP 3118 for the
> >>> "abstract"
> >>> array that is only available block-wise with getters and setters. On
> the
> >>> Cython list we've decided that what we want for CEP 1000 (for boxing
> >>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
> >>> create CEP 1001 to solve this issue and make any Python object an
> >>> exporter
> >>> of "block-getter/setter-arrays" (better name needed).
> >>>
> >>> What would be exported is (of course) a simple vtable:
> >>>
> >>> typedef struct {
> >>>    int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t
> *lower_right,
> >>> ...);
> >>>    ...
> >>> } block_getter_setter_array_vtable;
> >>>
> >>> Let's please discuss the details *after* the fundamentals. But the
> reason
> >>> I
> >>> put void* there instead of PyObject* is that I hope this could be used
> >>> beyond the Python world (say, Python<->Julia); the void* would be
> handed
> >>> to
> >>> you at the time you receive the vtable (however we handle that).
> >>
> >>
> >> I suppose it would also be useful to have some way of predicting the
> >> output format polymorphically for the caller. E.g. dense *
> >> block_diagonal results in block diagonal, but dense + block_diagonal
> >> results in dense, etc. It might be useful for the caller to know
> >> whether it needs to allocate a sparse, dense or block-structured
> >> array. Or maybe the polymorphic function could even do the allocation.
> >> This needs to happen recursively of course, to avoid intermediate
> >> temporaries. The compiler could easily handle that, and so could numpy
> >> when it gets lazy evaluation.
> >
> >
> > Ah. But that depends too on the computation to be performed too; a)
> > elementwise, b) axis-wise reductions, c) linear algebra...
> >
> > In my oomatrix code (please don't look at it, it's shameful) I do this
> using
> > multiple dispatch.
> >
> > I'd rather ignore this for as long as we can, only implementing "a[:] =
> ..."
> > -- I can't see how decisions here would trickle down to the API that's
> used
> > in the kernel, it's more like a pre-phase, and better treated
> orthogonally.
> >
> >
> >> I think if the heavy lifting of allocating output arrays and exporting
> >> these arrays work in numpy, then support in Cython could use that (I
> >> can already hear certain people object to more complicated array stuff
> >> in Cython :). Even better here would be an external project that each
> >> our projects could use (I still think the nditer sorting functionality
> >> of arrays should be numpy-agnostic and externally available).
> >
> >
> > I agree with the separate project idea. It's trivial for NumPy to
> > incorporate that as one of its methods for exporting arrays, and I don't
> > think it makes sense to either build it into Cython, or outright depend
> on
> > NumPy.
> >
> > Here's what I'd like (working title: NumBridge?).
> >
> >  - Mission: Be the "double* + shape + strides" in a world where that is
> no
> > longer enough, by providing tight, focused APIs/ABIs that are usable
> across
> > C/Fortran/Python.
> >
> > I basically want something I can quickly acquire from a NumPy array, then
> > pass it into my C code without dragging along all the cruft that I don't
> > need.
> >
> >  - Written in pure C + specs, usable without Python
> >
> >  - PEP 3118 "done right", basically semi-standardize the internal Cython
> > memoryview ABI and get something that's passable on stack
> >
> >  - Get block get/put API
> >
> >  - Iterator APIs
> >
> >  - Utility code for exporters and clients (iteration code, axis
> reordering,
> > etc.)
> >
> > Is the scope of that insane, or is it at least worth a shot to see how
> bad
> > it is? Beyond figuring out a small subset that can be done first, and
> > whether performance considerations must be taken or not, there's two
> > complicating factors: Pluggable dtypes, memory management. Perhaps you
> could
> > come to Oslo for a couple of days to brainstorm...
> >
> > Dag
>
> The blocks are a good idea, but I think fairly complicated for
> complicated matrix layouts. It would be nice to have something that is
> reasonably efficient for at least most of the array storage
> mechanisms.
> I'm going to do a little brain dump below, let's see if anything is useful
> :)
>
> What if we basically take the CSR format and make it a little simpler,
> easier to handle, and better suited for other layouts. Basically, keep
> shape/strides for dense arrays, and for blocked arrays just "extend"
> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
> something like this:
>
> >>> a = np.arange(4).repeat(4).reshape(4, 4);
> >>> a
> array([[0, 0, 0, 0],
>           [1, 1, 1, 1],
>           [2, 2, 2, 2],
>           [3, 3, 3, 3]])
>
> >>> a.shape = (2, 2, 2, 2)
> >>> itemsize = a.dtype.itemsize
> >>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
> >>> a
> array([[[[0, 0],
>         [1, 1]],
>
>        [[0, 0],
>         [1, 1]]],
>
>
>       [[[2, 2],
>         [3, 3]],
>
>        [[2, 2],
>         [3, 3]]]])
>
> >>> print a.flatten()
> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>
> Now, given some buffer flag (PyBUF_Sparse or something), use basically
> suboffsets with indirect dimensions, where only ever the last
> dimension is a row of contiguous memory (the entire thing may be
> contiguous, but the point is that you don't care). This row may
>
>    - be variable sized
>    - have either a single "column start" (e.g. diagonal, band/tri-
> diagonal, block diagonal, etc), OR
>    - a list of column indices, the length of the row (like in the CSR
> format)
>
> The length of each innermost row is then determined by looking at, in
> order:
>    - the extent as specified in the shape list
>    - if -1, and some flag is set, the extent is determined like CSR,
> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>        -> maybe instead of pointers indices are better, for
> serialization, GPUs, etc
>    - otherwise, use either a function pointer or perhaps a list of extents
>
> All these details will obviously be abstracted, allowing for easy
> iteration, but it can also be used by ufuncs operating on contiguous
> rows (often, since the actual storage is contiguous and stored in a 1D
> array, some flag could indicate contiguity as an optimization for
> unary ufuncs and flat iteration). Tiled nditer-ation could also work
> here without too big a hassle.
> When you slice, you add to the suboffset and manipulate the single
> extent or entire list of extents in that dimension, and the flag to
> determine the length using the pointer subtraction should be cleared
> (this should probably all happen through vtable functions).
>
> An exporter would also be free to use different malloced pointers,
> allowing variable sized array support with append/pop functionality in
> multiple dimensions (if there are no active buffer views).
>
> Random access in the case where a column start is provided is still
> contant time, and done though a[i][j][k - colstart], unless you have
> discontiguous rows, in which case you are allowed a logarithmic search
> (if the size exceeds some threshold). I see scipy.sparse does a linear
> search, which is pretty slow in pathological cases:
>
> from scipy import sparse
> a = np.random.random((1, 4000000))
> b = sparse.csr_matrix(a)
> %timeit a[0, 1000]
> %timeit b[0, 1000]
>
> 10000000 loops, best of 3: 190 ns per loop
> 10 loops, best of 3: 29.3 ms per loop
>
> Now, depending on the density and operation, the caller may have some
> idea of how to allocate an output array. I'm not sure how to handle
> "insertions" of new elements, but I presume through vtable put/insert
> functions. I'm also not sure how to fit this in with linear algebra
> functionality, other than providing conversions of the view.
>
> I think a disadvantage of this scheme is that you can't reorder your
> axes anymore, and many operations that are easy in the dense case are
> suddenly harder, e.g. this scheme does not allow you to go from a
> csr-like format into csc. But I think what this gives is reasonable
> generality to allow easy use in C/Fortran/Cython compiled code/numpy
> ufunc invocation, as well as allowing efficient-ish storage for
> various kinds of arrays.
>
> Does this make any sense?
>

I think it does, although it is not clear to me how it would generalize to
more than 2 dimensions (i.e. how would you handle a 3d sparse array). Would
you add more level of indirection ?

I have been thinking a bit about related concepts in the context of
generalized sparse arrays (which could be a first step toward numpy arrays
on top of multiple malloc blocks instead of just one), and one of the
solution I have seen is generalized b-tree/b*-trees. The main appeal to me
is that the underlying storage is still one-dimensional, and the "magic"
happens "only" in the indexing. This has several advantages:

  - getting the low level block algorithms rights is a solved problem
  - it allows for growable arrays in O(log(N)) instead of O(N)
  - one can hope to get near optimal performances in the common cases (1
and 2d) with degenerate indexing functions.

Two of the references I have been looking at:
  - UBTree: http://www.scholarpedia.org/article/B-tree_and_UB-tree
  - Storing Matrices on Disk : Theory and Practice Revisited (by Zhang,
Yi, Munagala, Kamesh and Yang, Jun)

I have meant to implement some of those ideas and do some basic benchmarks
(especially to compare against CSR and CSC in 2 dimensions, in which case
one could imagine building the index function in such a way as range
queries map to contiguous blocks of memory).

cheers,

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120514/de157b80/attachment.html>