[Numpy-discussion] Fast threading solution thoughts

Thu Feb 12 10:05:36 EST 2009

On Thu, Feb 12, 2009 at 8:19 AM, Michael Abshoff
<michael.abshoff at googlemail.com> wrote:
>
> No even close. The current generation peaks at around 1.2 TFlops single
> precision, 280 GFlops double precision for ATI's hardware. The main
> problem with those numbers is that the memory on the graphics card
> cannot feed the data fast enough into the GPU to achieve theoretical
> peak. So those hundreds of GFlops are pure marketing :)
>

If your application is memory bandwidth limited, then yes you're not
likely to see 100s of GFlops anytime soon.  However, compute limited
application can and do achieve 100s of GFlops on GPUs.  Basic
operations like FFTs and (level 3) BLAS are compute limited, as are
the following applications:
http://www.ks.uiuc.edu/Research/gpu/
http://www.dam.brown.edu/scicomp/scg-media/report_files/BrownSC-2008-27.pdf

> So in reality you might get anywhere from 20% to 60% (if you are lucky)
> locally before accounting for transfers from main memory to GPU memory
> and so on. Given that recent Intel CPUs give you about 7 to 11 Glops
> Double per core and libraries like ATLAS give you that performance today
> without the need to jump through hoops these number start to look a lot
> less impressive.

You neglect to mention that CPUs, which have roughly 1/10th the memory
bandwidth of high-end GPUs, are memory bound on the very same
problems.  You will not see 7 to 11 GFLops on a memory bound CPU code
for the same reason you argue that GPUs don't achieve 100s of GFLops
on memory bound GPU codes.

In severely memory bound applications like sparse matrix-vector
multiplication (i.e. A*x for sparse A) the best GPU performance you
can expect is ~10 GFLops on the GPU and ~1 GFLop on the CPU (in double
precision).  We discuss this problem in the following tech report:
http://forums.nvidia.com/index.php?showtopic=83825

It's true that host<->device transfers can be a bottleneck.  In many
cases, the solution is to simply leave the data resident on the GPU.
For instance, you could imagine a variant of ndarray that held a
pointer to a device array.  Of course this requires that the other
expensive parts of your algorithm also execute on the GPU so you're
not shuttling data over the PCIe bus all the time.

Full Disclosure: I'm a researcher at NVIDIA

-- 
Nathan Bell wnbell at gmail.com
http://graphics.cs.uiuc.edu/~wnbell/