[Cython] OpenCL support

Thu Feb 9 00:28:29 CET 2012

On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote:
> On 02/08/2012 11:11 PM, mark florisson wrote:
>> On 8 February 2012 14:46, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 02/05/2012 10:57 PM, mark florisson wrote:
>>>>
>>>> Hey,
>>>>
>>>> I created a CEP for opencl support:
>>>> http://wiki.cython.org/enhancements/opencl
>>>> What do you think?
>>>
>>>
>>> To start with my own conclusion on this, my feel is that it is too
>>> little
>>> gain, at least for a GPU solution. There's already Theano for trivial
>>> SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of
>>> course, this
>>> CEP would be more convenient to use than Theano if one is already using
>>> Cython.)
>>
>> Yes, vector operations and elemental or reduction functions operator
>> on vectors (which is what we can use Theano for, right?) don't quite
>> merit the use of OpenCL. However, the upside is that OpenCL allows
>> easier vectorization and multi-threading. We can appease to
>> auto-vectorizing compilers, but e.g. using OpenMP for multithreading
>> will still segfault the program if used outside the main thread with
>> gcc's implementation. I believe intel allows you to use it in any
>> thread. (Of course, keeping a thread pool around and managing it
>> manually isn't too hard, but...)
>>
>>> But that's just my feeling, and I'm not the one potentially signing
>>> up to do
>>> the work, so whether it is "worth it" is really not my decision, the
>>> weighing is done with your weights, not mine. Given an implementation, I
>>> definitely support the inclusion in Cython for these kind of features
>>> (FWIW).
>>>
>>> First, CPU:
>>>
>>> OpenCL is probably a very good way of portably making use of SSE/AVX
>>> etc.
>>> But to really get a payoff then I would think that the real value
>>> would be
>>> in *not* using OpenCL vector types, just many threads, so that the
>>> OpenCL
>>> driver does the dirty work of mapping each thread to each slot in the
>>> CPU
>>> registers? I'd think the gain in using OpenCL is to emit scalar code and
>>> leave the dirty work to OpenCL. If one does the hard part and mapped
>>> variables to vectors and memory accesses to shuffles, one might as
>>> well go
>>> the whole length and emit SSE/AVX rather than OpenCL to avoid the
>>> startup
>>> overhead.
>>>
>>> I don't really know how good the Intel and AMD CPU drivers are w.r.t.
>>> this
>>> -- I have seen the Intel driver emit "vectorizing" and "could not
>>> vectorize", but didn't explore the circumstances.
>>>
>>
>> I initially thought the same thing, single kernel invocations should
>> be trivially auto-vectorizable one would think. At least with Apple
>> OpenCL I am getting better performance with vector types though on the
>> CPU (up to 35%). I would personally consider emitting vector data
>> types bonus points.
>>
>> But I don't quite agree that emitting SSE or AVX directly would be
>> almost as easy in that case. You'd still have to detect at runtime
>> which instruction set is supported and generate SSE, SSE2, (SSE4?) and
>> AVX. And that's not even all of them :) The OpenCL drivers just hide
>> that pain. With handwritten code you might be coding for a specific
>> architecture and might be fine with only SSE2, but as a compiler we
>> can't really make that same decision.
>
> You make good points.
>
>>
>>> Then, on to GPU:
>>>
>>> It is not a generic-purpose solution, you still need to bring in
>>> pyopencl
>>> for lots of cases, and so the question is how many cases it fits with
>>> and if
>>> it is enough to grow a userbase around it. And, importantly, how much
>>> performance is sacrificed for the resulting user-friendlyness. 50%
>>> performance hit is usually OK, 95% maybe not. And a 95% hit is not
>>> unimaginable if the memory movement is done in a bad way for some code?
>>
>> Yes, I don't expect this to change a lot suddenly. In the long term I
>> think the implementation could be sufficiently good to support at
>> least most codes. And the user still has full control over data
>> movement, if wanted (the pinning thing, which isn't mentioned in the
>> CEP).
>>
>>> I think the fundamental problem is one of programming paradigms.
>>> Fortran,
>>> C++, Cython are all sequential in nature; even with OpenMP it is like
>>> you
>>> have a modest bit of parallelism tacked on to speed up a
>>> sequential-looking
>>> program. With "massively parallel" solutions such as CUDA and OpenCL,
>>> and
>>> also MPI in fact, the fundamental assumption that you have thousands or
>>> hundreds of thousands of threads. And that just changes how you need to
>>> think about writing code, which would tend to show up at a syntax
>>> level. So,
>>> at least if you want good performance, you need to change your way of
>>> thinking enough that a new syntax (loosely cooperating threads rather
>>> than
>>> parallel-for-loop or SIMD instruction) is actually an advantage, as
>>> it keeps
>>> you reminded of how the hardware works.
>>>
>>> So I think the most important thing to do (if you bother) is: Gather
>>> a set
>>> of real worl(-ish) CUDA or OpenCL programs, port them to Cython +
>>> this CEP
>>> (without a working Cython implementation for it), and see how that goes.
>>> That's really the only way to evaluate it.
>>
>> I've been wanting to do that for a long time now, also to evaluate the
>> capabilities of cython.parallel as it stands now. It's a really good
>> idea, I'll try to port some codes, and not just the trivial ones like
>> Jacobi's method :).
>>
>>> Some experiences from the single instance GPU code I've written:
>>>
>>> - For starters I had to give up OpenCL and use CUDA to use all the 48 KB
>>> available shared memory on Nvidia compute-capability-2.0 (perhaps I just
>>> didn't find the OpenCL option for that). And increasing from 16 to 48 KB
>>> allowed a fundamentally faster and qualitatively different algorithm
>>> to be
>>> used. But OpenCL vs. CUDA is kind of beside the point here....
>>>
>>> - When mucking about with various "obvious" ports of sequential code
>>> to GPU
>>> code, I got performance in the range of 5 to 20 GFLOP/s (out of 490
>>> GFLOP/s
>>> or so theoretical; NVidia Tesla M2050). When really understanding the
>>> hardware, and making good use of the 48 KB of thread-shared memory, I
>>> achieved 209 GFLOP/s, without really doing any microoptimization. I
>>> don't
>>> think the CEP includes any features for intra-thread communication, so
>>> that's off the table.
>>
>> The CEP doesn't mention barriers (discussed earlier), but they should
>> be supported, and __local memory (that's "shared memory" in CUDA terms
>> right?) could be utilized using a more explicit scheme (or implicitly
>> if the compiler is smart). The only issue with barriers is that with
>> OpenCL you have multiple levels of synchronization, but barriers only
>> work within the work group / thread block, whereas with openmp it
>> works simply for all your threads. I think a global barrier would have
>> to mean kernel termination + start of a new one, which could be hard
>> to support depending on where it is placed in the code...
>>
>>> (My code is here:
>>>
>>> https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in
>>>
>>>
>>> Though it's badly documented and rush-for-deadline-quality; I plan to
>>> polish
>>> it up and publish it when I get time in autumn).
>>>
>>> I guess I mention this as the kind of computation your CEP definitely
>>> does
>>> NOT cover. That's probably OK, but one should figure out specifically
>>> how
>>> many usecases it does cover (in particular with no control over thread
>>> blocks and intra-block communication). Is the CEP a 80%-solution, or a
>>> 10%-solution?
>>
>> I haven't looked too carefully at the code, but a large portion is
>> dedicated to a reduction right? What I don't see is how your reduction
>> spans across multiple work-groups / thread blocks? Because
>> __syncthreads should only sync stuff within a single block. The CEP

Most of the time there's actually no explicit synchronization, but the 
code relies on all threads of a warp being on the same instruction in 
the scheduler. __synchtreads is then only used at the end of the 
reduction when all within-warp additions have been done. Calling 
__syncthreads at each step of the algorithm would have totally killed 
performance.

Dag Sverre

>
> There's no need to reduce across thread blocks because (conveniently
> enough) there's 8000 independent computations to be performed with
> different parameters. I simply used one thread block for each problem.
>
> It's basically a matrix-vector product where the matrix must be
> generated on the fly columnwise (one entry can be generated from the
> preceding two in the same column), but the summation is row-wise.
>
> And turns out that getting inter-thread sum-reduction to work well was
> harder than I expected; a 32-by-32 matrix (needed since warps are 32
> threads) is too big to fit in memory, but tree-reduction makes a lot of
> the threads in a warp do nothing. So I ended up with a hybrid approach;
> there's visual demo from page 49 onwards here:
>
> http://folk.uio.no/dagss/talk-gpusht.pdf
>
> Getting back to Cython, I'll admit that this form of inter-thread
> reduction is quite generic, and that my specific problem could be solved
> by basically coding a set of inter-thread reduction algorithms suitable
> for different hardware into Cython.
>
>> didn't mention reductions, but they should be supported (I'm thinking
>> multi-stage or sequential within the workgroup (whatever works
>> better), followed by another kernel invocation if the result is
>> needed).
>
> Multiple kernel invocations for global barriers appear to be pretty
> standard, and it's why OpenCL support queueing tasks with dependencies etc.
>
>> As mentioned earlier in a different thread (on parallelism I think),
>> reduction arrays (i.e. memoryviews or C arrays) as well as generally
>> private arrays should be supported. An issue with that is that you
>> can't really dedicate an array to each work item / thread (too much
>> memory would be consumed).
>>
>> Again, declarations within blocks would solve many problems:
>>
>> cdef float[n] shared_by_work_group
>> with parallel():
>> cdef float[n] local_to_work_group
>> for i in prange(...):
>> cdef float[n] local_to_work_item
>>
>> For arrays, the reductions could be somewhat more explicit, where
>> there is an explicit 'my_memoryview += my_local_scratch_data'. That
>> should probably only be allowed for memory local to the work group.
>>
>> Anyway, I'll try porting some numerical codes to this scheme over the
>> coming weeks and see what is missing and how it can be solved. I still
>> believe it can all be made to work quite properly, without adjusting
>> the language to fit the hardware model. The prange (and OpenMP) model
>> look like sequential code, but they tell the compiler a lot, namely
>> that each iteration is independent and could therefore be scheduled as
>> a separate thread.
>
> Again, good points.
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel