Multiprocessing, shared memory vs. pickled copies

Sat Apr 9 16:18:47 EDT 2011

On Apr 9, 10:15 am, sturlamolden <sturlamol... at yahoo.no> wrote:
> On 9 apr, 09:36, John Ladasky <lada... at my-deja.com> wrote:
>
> > Thanks for finding my discussion!  Yes, it's about passing numpy
> > arrays to multiple processors.  I'll accomplish that any way that I
> > can.
>
> My preferred ways of doing this are:
>
> 1. Most cases for parallel processing are covered by libraries, even
> for neural nets. This particularly involves linear algebra solvers and
> FFTs, or calling certain expensive functions (sin, cos, exp) over and
> over again. The solution here is optimised LAPACK and BLAS (Intel MKL,
> AMD ACML, GotoBLAS, ATLAS, Cray libsci), optimised FFTs (FFTW, Intel
> MKL, ACML), and fast vector math libraries (Intel VML, ACML). For
> example, if you want to make multiple calls to the function "exp",
> there is a good chance you want to use a vector math library. Despite
> of this, most Python programmers' instinct seems to be to use multiple
> processes with numpy.exp or math.exp, or use multiple threads in C
> with exp from libm (cf. math.h). Why go through this pain when a
> single function call to Intel VML or AMD ACML (acml-vm) will be much
> better? It is common to see scholars argue that "yes but my needs are
> so special that I need to customise everything myself." Usually this
> translates to "I don't know these libraries (not even that they exist)
> and are happy to reinvent the wheel."  

Whoa, Sturla.  That was a proper core dump!

You're right, I'm unfamiliar with the VAST array of libraries that you
have just described.  I will have to look at them.  It's true, I
probably only know of the largest and most widely-used Python
libraries.  There are so many, who can keep track?

Now, I do have a special need.  I've implemented a modified version of
the Fahlman Cascade Correlation algorithm that will not be found in
any existing libraries, and which I think should be superior for
certain types of problems.  (I might even be able to publish this
algorithm, if I can get it working and show some examples?)

That doesn't mean that I can't use the vector math libraries that
you've recommended.  As long as those libraries can take advantage of
my extra computing power, I'm interested.  Note, however, that the
cascade evaluation does have a strong sequential requirement.  It's
not a traditional three-layer network.  In fact, describing a cascade
network according to the number of "layers" it has is not very
meaningful, because each hidden node is essentially its own layer.

So, there are limited advantages to trying to parallelize the
evaluation of ONE cascade network's weights against ONE input vector.
However, evaluating several copies of one cascade network's output,
against several different test inputs simultaneously, should scale up
nicely.  Evaluating many possible test inputs is exactly what you do
when training a network to a data set, and so this is how my program
is being designed.

> Thus, if you think you need to
> use manually managed threads or processes for parallel technical
> computing, and even contemplate that the GIL might get in your way,
> there is a 99% chance you are wrong. You will almost ALWAYS want ot
> use a fast library, either directly in Python or linked to your own
> serial C or Fortran code. You have probably heard that "premature
> optimisation is the root of all evil in computer programming." It
> particularly applies here.

Well, I thought that NUMPY was that fast library...

Funny how this works, though -- I built my neural net class in Python,
rather than avoiding numpy and going straight to wrapping code in C,
precisely because I wanted to AVOID premature optimization (for
unknown, and questionable gains in performance).  I started on this
project when I had only a single-core CPU, though.  Now that multi-
core CPU's are apparently here to stay, and I've seen just how long my
program takes to run, I want to make full use of multiple cores.  I've
even looked at MPI.  I'm considering networking to another multi-CPU
machine down the hall, once I have my program working.

> But again, I'd urge you to consider a library or threads
> (threading.Thread in Cython or OpenMP) before you consider multiple
> processes.

My single-CPU neural net training program had two threads, one for the
GUI and one for the neural network computations.  Correct me if I'm
wrong here, but -- since the two threads share a single Python
interpreter, this means that only a single CPU is used, right?  I'm
looking at multiprocessing for this reason.

> The reason I have not updated the sharedmem arrays for two
> years is that I have come to the conclusion that there are better ways
> to do this (paricularly vendor tuned libraries). But since they are
> mostly useful with 64-bit (i.e. large arrays), I'll post an update
> soon.
>
> If you decide to use a multithreaded solution (or shared memory as
> IPC), beware of "false sharing". If multiple processors write to the
> same cache line (they can be up to 512K depending on hardware), you'll
> create an invisible "GIL" that will kill any scalability. That is
> because dirty cache lines need to be synchonized with RAM. "False
> sharing" is one of the major reasons that "home-brewed" compute-
> intensive code will not scale.

Even though I'm not formally trained in computer science, I am very
conscious of the fact that WRITING to shared memory is a problem,
cache or otherwise.  At the very top of this thread, I pointed out
that my neural network training function would need READ-ONLY access
to two items -- the network weights, and the input data.  Given that,
and my (temporary) struggles with pickling, I considered the shared-
memory approach as an alternative.

> It is not uncommon to see Java programmers complain about Python's
> GIL, and then they go on to write i/o bound or false shared code. Rest
> assured that multi-threaded Java will not scale better than Python in
> these cases :-)

I've never been a Java programmer, and I hope it stays that way!