[Numpy-discussion] NumPy to CPU+GPU compiler, looking for tests

Mon Oct 29 11:26:07 EDT 2012

On Mon, 2012-10-29 at 11:11 -0400, Frédéric Bastien wrote:
> > Assuming of course all the relevant backends are up to scratch.
> >
> > Is there a fundamental reason why targetting a CPU through OpenCL is
> > worse than doing it exclusively through C or C++?
> 
> First, opencl do not allow us to do pointor arythmetique. So when
> taking a slice of an ndarray, we can't just more the pointor. So we
> need to change the object structure.
> 
> I didn't do any speed anylysis of this, but I think that by using
> OpenCL, it would have a bigger overhead. So it is only useful for
> "big" ndarray. I don't have any size in mind too. I don't know, but if
> we could access the opencl data directly from C/C++, we could bypass
> this for small array if we want. But maybe this is not possible! 

My understanding is that when running OpenCL on CPU, one can simply map
memory from a host pointer using CL_MEM_USE_HOST_PTR during buffer
creation. On a CPU, this will result in no copies being made.

The overhead is clearly an issue, and was the subject of my question. I
wouldn't be surprised to find that the speedup associated with the free
multithreading that comes with OpenCL on CPU, along with the vector data
types mapping nicely to SSE etc, would make OpenCL on CPU faster on any
reasonably sized array.

It strikes me that if there is a neat way in which numpy objects can be
represented by coherent versions in both main memory and device memory,
then OpenCL could be used when it makes sense (either on CPU or GPU),
and the CPU natively when _it_ makes sense.

Henry