[pypy-dev] gpgpu and pypy

Jim Baker jbaker at zyasoft.com
Fri Aug 20 20:25:11 CEST 2010

Jython single-threaded performance has little to do with a lack of the GIL.
Probably the only direct manifestation is seen in the overhead of allocating
__dict__ (or dict) objects because Python attributes have volatile memory
semantics, which is ensured by the backing of a ConcurrentHashMap, which can
be expensive to allocate. There are workarounds.

2010/8/20 Paolo Giarrusso <p.giarrusso at gmail.com>

> 2010/8/20 Jorge Timón <timon.elviejo at gmail.com>:
> > Hi, I'm just curious about the feasibility of running python code in a
> gpu
> > by extending pypy.
> Disclaimer: I am not a PyPy developer, even if I've been following the
> project with interest. Nor am I an expert of GPU - I provide links to
> the literature I've read.
> Yet, I believe that such an attempt is unlikely to be interesting.
> Quoting Wikipedia's synthesis:
> "Unlike CPUs however, GPUs have a parallel throughput architecture
> that emphasizes executing many concurrent threads slowly, rather than
> executing a single thread very fast."
> And significant optimizations are needed anyway to get performance for
> GPU code (and if you don't need the last bit of performance, why
> bother with a GPU?), so I think that the need to use a C-like language
> is the smallest problem.
> > I don't have the time (and probably the knowledge neither) to develop
> that
> > pypy extension, but I just want to know if it's possible.
> > I'm interested in languages like openCL and nvidia's CUDA because I think
> > the future of supercomputing is going to be GPGPU.
> I would like to point out that while for some cases it might be right,
> the importance of GPGPU is probably often exaggerated:
> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1#
> Researchers in the field are mostly aware of the fact that GPGPU is
> the way to go only for a very restricted category of code. For that
> code, fine.
> Thus, instead of running Python code in a GPU, designing from scratch
> an easy way to program a GPU efficiently, for those task, is better,
> and projects for that already exist (i.e. what you cite).
> Additionally, it would take probably a different kind of JIT to
> exploit GPUs. No branch prediction, very small non-coherent caches, no
> efficient synchronization primitives, as I read from this paper... I'm
> no expert, but I guess you'd need to rearchitecture from scratch the
> needed optimizations.
> And it took 20-30 years to get from the first, slow Lisp (1958) to,
> say, Self (1991), a landmark in performant high-level languages,
> derived from SmallTalk. Most of that would have to be redone.
> So, I guess that the effort to compile Python code for a GPU is not
> worth it. There might be further reasons due to the kind of code a JIT
> generates, since a GPU has no branch predictor, no caches, and so on,
> but I'm no GPU expert and I would have to check again.
> Finally, for general purpose code, exploiting the big expected number
> of CPUs on our desktop systems is already a challenge.
> > There's people working in
> > bringing GPGPU to python:
> >
> > http://mathema.tician.de/software/pyopencl
> > http://mathema.tician.de/software/pycuda
> >
> > Would it be possible to run python code in parallel without the need (for
> > the developer) of actively parallelizing the code?
> I would say that Python is not yet the language to use to write
> efficient parallel code, because of the Global Interpreter Lock
> (Google for "Python GIL"). The two implementations having no GIL are
> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL,
> and the current focus is not on removing it.
> Scientific computing uses external libraries (like NumPy) - for the
> supported algorithms, one could introduce parallelism at that level.
> If that's enough for your application, good.
> If you want to write a parallel algorithm in Python, we're not there yet.
> > I'm not talking about code of hard concurrency, but of code with
> intrinsic
> > parallelism (let's say matrix multiplication).
> Automatic parallelization is hard, see:
> http://en.wikipedia.org/wiki/Automatic_parallelization
> Lots of scientists have tried, lots of money has been invested, but
> it's still hard.
> The only practical approaches still require the programmer to
> introduce parallelism, but in ways much simpler than using
> multithreading directly. Google OpenMP and Cilk.
> > Would a JIT compilation be capable of detecting parallelism?
> Summing up what is above, probably not.
> Moreover, matrix multiplication may not be so easy as one might think.
> I do not know how to write it for a GPU, but in the end I reference
> some suggestions from that paper (where it is one of the benchmarks).
> But here, I explain why writing it for a CPU is complicated. You can
> multiply two matrixes with a triply nested for, but such an algorithm
> has poor performance for big matrixes because of bad cache locality.
> GPUs, according to the above mentioned paper, provide no caches and
> hides latency in other ways.
> See here for the two main alternative ideas which allow solving this
> problem of writing an efficient matrix multiplication algorithm:
> http://en.wikipedia.org/wiki/Cache_blocking
> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
> Then, you need to parallelize the resulting code yourself, which might
> or might not be easy (depending on the interactions between the
> parallel blocks that are found there).
> In that paper, where matrix multiplication is called as SGEMM (the
> BLAS routine implementing it), they suggest using a cache-blocked
> version of matrix multiplication for both CPUs and GPUs, and argue
> that parallelization is then easy.
> Cheers,
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
