[Cython] gsoc: array expressions

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Tue May 22 08:36:25 CEST 2012


On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>
>>> Hey,
>>>
>>> For my gsoc we already have some simple initial ideas, i.e.
>>> elementwise vector expressions (a + b with a and b arrays with
>>> arbitrary rank), I don't think these need any discussion. However,
>>> there are a lot of things that haven't been formally discussed on the
>>> mailing list, so here goes.
>>>
>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>> mailing list, and I think your insights as a Theano developer can be
>>> very helpful in this discussion.
>>>
>>> User Interface
>>> ===========
>>> Besides simple array expressions for dense arrays I would like a
>>> mechanism for "custom ufuncs", although to a different extent to what
>>> Numpy or Numba provide. There are several ways in which we could want
>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>> lambas or Python functions in the same module, or as general objects
>>> (e.g. functions Cython doesn't know about).
>>> To achieve maximum efficiency it will likely be good to allow sharing
>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>> would prefer annotated def functions where the parameters are
>>> specialized on demand, e.g.
>>>
>>> @elemental
>>> def add(a, b): # elemental functions can have any number of arguments
>>> and operate on any compatible dtype
>>>      return a + b
>>>
>>> When calling cdef functions or elemental functions with memoryview
>>> arguments, the arguments perform a (broadcasted) elementwise
>>> operation. Alternatively, we can have a parallel.elementwise function
>>> which maps the function elementwise, which would also work for object
>>> callables. I prefer the former, since I think it will read much
>>> easier.
>>>
>>> Secondly, we can have a reduce function (and maybe a scan function),
>>> that reduce (respectively scan) in a specified axis or number of axes.
>>> E.g.
>>>
>>>      parallel.reduce(add, a, b, axis=(0, 2))
>>>
>>> where the default for axis is "all axes". As for the default value,
>>> this could be perhaps optionally provided to the elemental decorator.
>>> Otherwise, the reducer will have to get the default values from each
>>> dimension that is reduced in, and then skip those values when
>>> reducing. (Of course, the reducer function must be associate and
>>> commutative). Also, a lambda could be passed in instead of an
>>
>>
>> Only associative, right?
>>
>> Sounds good to me.
>>
>>
>>> elementwise or typed cdef function.
>>>
>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>> function, which would iterate over N memoryviews, and provide a
>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>> should provide only the indices, or also the values. e.g. an inplace
>>> elementwise add would read as follows:
>>>
>>>      for i, j, k in parallel.nditerate(A, B):
>>>          A[i, j, k] += B[i, j, k]
>>
>>
>>
>> I think this sounds good; I guess don't see a particular reason for
>> "ndenumerate", I think code like the above is clearer.
>
> I'm assuming the index computations would not be re-done in this case
> (i.e. there's more magic going on here than looks like at first
> glance)? Otherwise there is an advantage to ndenumerate.

Ideally, there is a lot more magic going on, though I don't know how far 
Mark wants to go.

Imagine "nditerate(A, A.T)", in that case it would have to make many 
small tiles so that for each tile being processed, A has a tile in cache 
and A.T has another tile in cache (so that one doesn't waste cache line 
transfers).

So those array lookups would potentially look up in different memory 
buffers, with the strides known at compile time.

Which begs the question: What about this body?

if i < 100:
     continue
else:
     A[i, j, k] += B[i - 100, j, k]

I guess just fall back to a non-tiled version? One could of course do 
some shifting of which tiles of B to grab etc., but there's a limit to 
how smart one should try to be; one could emit a warning and say that 
one should slice and dice the memoryviews into shape before they are 
passed to nditerate.

Dag


More information about the cython-devel mailing list