[Cython] gsoc: array expressions

Mon May 28 14:52:35 CEST 2012

On 28 May 2012 13:49, mark florisson <markflorisson88 at gmail.com> wrote:
> On 25 May 2012 21:53, Frédéric Bastien <nouiz at nouiz.org> wrote:
>> Hi,
>>
>>
>> Sorry for the delay, I had some schedule change.
>>
>> thanks for adding me. Should I subscribe to cython-dev? How much email
>> daily there is? I didn't found this on the archives. Fell free to add
>> me in CC again when you think it is appropriate.
>
> There is usually not so much traffic on cython-dev, unless something
> comes up that is debated to the death :)
>
>> I'll reply here to all email at the same time. Do you prefer that I
>> reply to each email individually if this happen again? I'll try to
>> reply faster next time.
>
> No worries, either way works fine, don't worry too much about protocol
> (the only thing to note is that we do bottom posting).
>
>> - About pickling theano, we currently can't pick Theano function. It
>> could be made to work in some cases, but not for all cases as there is
>> hardware dependent optimization in the Theano function. Currently it
>> is mostly CPU vs GPU operation. So if we stay on the CPU, we could do
>> some pickling, but we should make sure that the compiled c code into
>> python module are still there when we unpickle or recompile them.
>>
>> - I think it make sense to make a theano graph from cython ast,
>> optimize and redo a cython ast from the optimized graph. This would
>> allow using Theano optimizations.
>
> Ok, the important thing is that the graph can be pickled, it should be
> pretty straightforward to generate code to build the function again
> from the loaded graph.
>
>> - It also make sense to do the code generation in Theano and reuse it
>> in Cython. But this would make the Theano dependency much stronger.
>> I'm not sure you want this.
>>
>>
>> - Another point not raised, theano need to know at compile time is the
>> dtype, number of dimensions and witch dimensions are broadcastable for
>> each variable. I think that the last one could cause problem, but if
>> you use specialization for the dtype, the same can be done for the
>> broadcsatability of a dimensions.
>
> Hm, that would lead to kind of an explosion of combinations. I think
> we could specialize only on no broadcasting at all (except for
> operands with lesser dimensionality).
>
>> - The compyte(gpu nd array) project do collapsing of dimensions. This
>> is an important optimization on the GPU as doing the index computation
>> in parallel is costlier. I think on the CPU we could probably do
>> collapsing just of the inner dimensions to make it faster.
>>
>> - Theano don't generate intrinsect or assembly, but we suppose that
>> g++ will generate vectorized operation for simple loop. Recent version
>> of gcc/g++ do this.
>
> Right, the aim is definitely to specialize for contiguous arrays,
> where you collapse everything. Specializing statically for anything
> more would be unfeasible, and better handled by a runtime compiler I
> think. For the C backend, I'll start by generating simple C loops and
> see if the compilers vectorize that already.
>
>> - Our generated code for element-wise operation take care a little
>> about the memory access pattern. We swap dimensions to iterate on the
>> dimensions with the smallest strides. But we don't go further.
>>
>> - What do you mean by CSE? Constant  optimization?
>
> Yes, common subexpression elimination and also hoisting of unchanging
> expressions outside the loop.
>
>> Fred
>
> I started a new project, https://github.com/markflorisson88/minivect ,
> which currently features a simple C code generator. The specializer
> and astbuilder do most of the work of creating the right AST, so the
> code generator only has to implement code generation functions for
> simple expressions. Depending on how it progresses I will look at
> incorporating Theano's optimizations into it and having Theano use it
> as a C backend for compatible expressions.

I forgot to mention, it's still pretty basic, but it works for simple
arithmetic expressions with non-overlapping (shifted) memory from
Cython: https://github.com/markflorisson88/cython/commit/2c316abdbc1228597bbdf480f737a59213ee9532#L4R1