[Numpy-discussion] Generator arrays

Thu Jan 27 19:37:00 EST 2011

On Thu, Jan 27, 2011 at 5:01 PM, Travis Oliphant <oliphant at enthought.com>wrote:

>
> Just to start the conversation, and to find out who is interested, I would
> like to informally propose generator arrays for NumPy 2.0.     This concept
> has as one use-case, the deferred arrays that Mark Wiebe has proposed.  But,
> it also allows for "compressed arrays", on-the-fly computed arrays, and
> streamed or generated arrays.
>
> Basically, the modification I would like to make is to have an array flag
> (MEMORY) that when set means that the data attribute of a numpy array is a
> pointer to the address in memory where the data begins with the strides
> attribute pointing to a C-array of integers (in other words, all current
> arrays are MEMORY arrays)
>
> But, when the MEMORY flag is not set, the data attribute instead points to
> a length-2 C-array of pointers to functions
>
>        [read(N, output_address, self->index_iter, self->extra),  write(N,
> input_address, self->index_iter, self->extra)]
>
> Either of these could then be NULL (i.e. if write is NULL, then the array
> must be read-only).
>
> When the MEMORY flag is not set, the strides member of the ndarray
> structure is a pointer to the index_iter object (which could be anything
> that the particular read and write methods need it to be).
>
> The array structure should also get a member to hold the "extra" argument
> (which would hold any state that the array needed to hold on to in order to
> correctly perform the read or write operations --- i.e. it could hold an
> execution graph for deferred evaluation).
>
> The index_iter structure is anything that the read and write methods need
> to correctly identify *where* to write.   Now, clearly, we could combine
> index_iter and extra into just one "structure" that holds all needed state
> for read and write to work correctly.   The reason I propose two slots is
> because at least mentally in the use case of having these structures be
> calculation graphs, one of these structures is involved in "computing the
> location to read/write" and the other is involved in "computing what to
> read/write"
>
> The idea is fairly simple, but with some very interesting potential
> features:
>
>        * lazy evaluation (of indexing, ufuncs, etc.)
>        * fancy indexing as views instead of copies (really just another
> example of lazy evaluation)
>        * compressed arrays
>        * generated arrays (from computation or streamed data)
>        * infinite arrays
>        * computed arrays
>        * missing-data arrays
>        * ragged arrays (shape would be the bounding box --- which makes me
> think of ragged arrays as examples of masked arrays).
>        * arrays that view PIL data.
>
> One could build an array with a (logically) infinite number of elements (we
> could use -2 in the shape tuple to indicate that).
>
> We don't need examples of all of these features for NumPy 2.0 to be
> released, because to really make this useful, we would need to modify all
> "calculation" code to produce a NON MEMORY array.     What to do here still
> needs a lot of thought and experimentation.
>
> But, I can think about a situation where all NumPy calculations that
> produce arrays provide the option that when they are done inside of a
> particular context,  a user-supplied behavior over-rides the default return.
>   I want to study what Mark is proposing and understand his new iterator at
> a deeper level before providing more thoughts here.
>
> That's the gist of what I am thinking about.   I would love feedback and
> comments.
>
> The other things I would like to see in NumPy 2.0 that have not been
> discussed lately (that could affect the ABI) are:
>
>        * a geometry member to the data structure (that allows labels to
> dimensions and axes to be provided -- ala data_array)
>        * small array performance improvements that Mark Wiebe has suggested
> (including the addition of an optional low-level loop that is used when you
> have contiguous data)
>        * completed datetime implementation
>        * pointer data-types (i.e. the memory location holds a pointer to
> another part of an ndarray) --- very useful for "join" - type arrays
>
> If anybody is interested in helping with any of these (and has time to do
> it, let me know).   Some of this I could fund (especially if you are willing
> to come to Austin and be an intern for Enthought).
>
> Best regards,
>
>
I'd kind of like to keep arrays simple, they are already pretty complex
objects. Perhaps a higher level interface to lower level objects with a
common API would be an easier way to go, that way functionality could be
added piecewise as the need arises. I think would be good to stick to need
driven additions as otherwise it is easy to get sucked into the quagmire of
trying to design for every need and eventuality and projects like that never
finish.

What happens to the buffer API/persistence with all those additions?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110127/b47f4823/attachment.html>