Parallelization of Python on GPU?

Thu Feb 26 10:27:14 EST 2015

On Wed, 2015-02-25 at 18:35 -0800, John Ladasky wrote:
> I've been working with machine learning for a while.  Many of the
> standard packages (e.g., scikit-learn) have fitting algorithms which
> run in single threads.  These algorithms are not themselves
> parallelized.  Perhaps, due to their unique mathematical requirements,
> they cannot be paralleized.  
> 
> When one is investigating several potential models of one's data with
> various settings for free parameters, it is still sometimes possible
> to speed things up.  On a modern machine, one can use Python's
> multiprocessing.Pool to run separate instances of scikit-learn fits.
> I am currently using ten of the twelve 3.3 GHz CPU cores on my machine
> to do just that.  And I can still browse the web with no observable
> lag.  :^)
> 
> Still, I'm waiting hours for jobs to finish.  Support vector
> regression fitting is hard.
> 
> What I would REALLY like to do is to take advantage of my GPU.  My
> NVidia graphics card has 1152 cores and a 1.0 GHz clock.  I wouldn't
> mind borrowing a few hundred of those GPU cores at a time, and see
> what they can do.  In theory, I calculate that I can speed up the job
> by another five-fold.
> 
> The trick is that each process would need to run some PYTHON code, not
> CUDA or OpenCL.  The child process code isn't particularly fancy.  (I
> should, for example, be able to switch that portion of my code to
> static typing.)
> 
> What is the most effective way to accomplish this task?

GPU computing is a lot more than simply saying "run this on a GPU".  To
realize the performance gains promised by a GPU, you need to tailor your
algorithms to take advantage of their hardware... SIMD reigns supreme
where thread divergence and branching are far more expensive than they
are in CPU computing.  So even if you decide to somehow translate your
Python code into a CUDA kernel, there is a good chance that you will be
woefully disappointed in the resulting speedup (or even moreso if you
actually get a slowdown :)).  For example, a simple reduction is more
expensive on a GPU than it is on a CPU for small arrays.  A dot product,
for example, has a part that's super fast on the GPU (element-by-element
multiplication), and then a part that gets a lot slower (summing up all
elements of the resulting multiplication).  Each core on the GPU is a
lot slower than a CPU (which is why a 1000-CUDA-core GPU doesn't run
anywhere near 1000x faster than a CPU), so you really only get gains
when they can all work efficiently together.

Another example -- matrix multiplies are *fast*.  Diagonalizations are
slow (which is why in my field where diagonalizations are common
requirements, they are often done on the CPU while *building* the matrix
is done on the GPU).
> 
> I came across a reference to a package called "Urutu" which may be
> what I need, however it doesn't look like it is widely supported.

Urutu seems to be built on PyCUDA and PyOpenCL (which are both written
by the same person; Andreas Kloeckner at UIUC in the United States).

Another package I would suggest looking into is numba, from Continuum
Analytics: https://github.com/numba/numba.  Unlike Urutu, their package
is built on LLVM and Python bindings they've written to implement
numpy-aware JIT capabilities.  I believe they also permit compiling down
to a GPU kernel through LLVM.  One downside I've experienced with that
package is that LLVM does not yet have a stable API (as I understand
it), so they often lag behind support for the latest versions of LLVM.
> 
> I would love it if the Python developers themselves added the ability
> to spawn GPU processes to the Multiprocessing module!

I would be stunned if this actually happened.  If you're worried about
performance, you get at least an order of magnitude performance boost by
going to numpy or writing the kernel directly in C or Fortran.  CPython
itself just isn't structured to run on a GPU... maybe pypy will tackle
that at some point in the probably-distant future.

All the best,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher