Multiprocessing: don't push the pedal to the metal?

Sat May 21 23:58:10 EDT 2011

Hello again, everyone.

I'm developing some custom neural network code.  I'm using Python
2.6,  Numpy 1.5, and Ubuntu Linux 10.10.  I have an AMD 1090T six-core
CPU.  About six weeks ago, I asked some questions about
multiprocessing in Python, and I got some very helpful responses from
you all.

http://groups.google.com/group/comp.lang.python/browse_frm/thread/374e1890efbcc87b

Now I'm back with a new question.  I have gotten comfortable with
cProfile, and with multiprocessing's various Queues (I've graduated
from Pool).  I just ran some extensive tests of my newest code, and
I've learned some surprising things.  I have a pretty picture here (be
sure to view the full-size image):

http://www.flickr.com/photos/15579975@N00/5744093219

I'll quickly ask my question first, to avoid a TL;DR problem: when you
have a multi-core CPU with N cores, is it common to see the
performance peak at N-1, or even N-2 processes?  And so, should you
avoid using quite as many processes as there are cores?  I was
expecting diminishing returns for each additional core -- but not
outright declines.

That's what I think my data shows for many of my trial runs.  I've
tried running this test twice.  Once, I was reading a few PDFs and web
pages while my speed test was running.  But even when I wasn't using
the computer for these other (light) tasks, I saw the same performance
drops.  Perhaps this is due to the OS overhead? The load average on my
system monitor looks pretty quiet to me when I'm not running my
program.

OK, if you care to read further, here's some more detail...

My graphs show the execution times of my neural network evaluation
routine as a function of:

- the size of my neural network (six sizes were tried -- with varying
numbers of inputs, outputs and hidden nodes),
- the subprocess configuration (either not using a subprocess, or
using 1-6 subprocesses), and
- the size of the input data vector (from 7 to 896 sets of inputs --
I'll explain the rationale for the exact numbers I chose if anyone
cares to know).

Each graph is normalized to the execution time that running the
evaluation routine takes on a single CPU, without invoking a
subprocess.  Obviously, I'm looking for the conditions which yield
performance gains above that baseline.  (I'll be running this
particular piece of code millions of times!)

I tried 200 repetitions for each combination network size, input data
size, and number of CPU cores.  Even so, there was substantial
irregularity in the timing graphs.  So, rather than connecting the
dots directly, which would lead to some messy crossing lines which are
a bit hard to read, I fit B-spline curves to the data.

As I anticipated, there is a performance penalty that is incurred just
for parceling out the data to the multiple processes and collating the
results at the end.  When the data set is small, it's faster to send
it to a single CPU, without invoking a subprocess.  In fact, dividing
a small task among 3 processes can underperform a two-process
approach, and so on!  See the leftmost two panels in the top row, and
the rightmost two panels in the bottom row.

When the networks increase in complexity, the size of the data set for
which break-even performance is achieved drops accordingly.  I'm more
concerned about optimizing these bigger problems, obviously, because
they take the longest to run.

What I did not anticipate was finding that performance reversal with
added computing power for large data sets. Comments are appreciated!