Parallel/Multiprocessing script design question

Thu Sep 13 02:48:28 EDT 2007

> I tend to ramble, and I am afraid none of you busy experts will bother 
> reading my long post

I think that's a fairly accurate description, and prediction.

> I am hoping people 
> with experience using any of these would chime in with tips. The main thing 
> I would look for in a toolkit is maturity and no extra dependencies. Plus a 
> wide user community is always good. 
> POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(

After reading your problem description, I think you need none of these.
Parallelization-wise, it seems to be a fairly straight-forward task,
with coarse-grained parallelism (which is the easier kind).

> 2. The processing involves multiple steps that each input file has to go 
> through. I am trying to decide between a batch mode design and a pipelined 
> design for concurrency.

Try to avoid synchronization as much as you can if you want a good
speed-up. Have long-running tasks that don't need to interact with
each other, rather than having frequent synchronization. If you
haven't read about Amdahl's law, do so now.

>From your description, it seems that having a single process process an
entire input file, from the beginning to the end, sounds like the right
approach; use the multiple CPUs to process different input files in
parallel (IIUC, they can be processed in any order, and simultaneously).

> In the batched design, all files will be processed 
> on one processing step(in parallel) before the next step is started.

Why that? If you have N "worker" processes, each one should do the
processing at its own rate. I.e. each one does all the steps for a
single file in sequence; no need to wait until all processes have
completed one step before starting the next one.

As for N: don't make it the number of input files. Instead, make it
the number of CPUs, or perhaps two times the number of CPUs. The maximum
speed-up out of k CPUs is k, so it is pointless (and memory-consuming)
to have many-more-than-k worker processes.

> In a pipelined design, each file will be taken through all steps to the end.

Perhaps this is what I suggested above - I would not call it
"pipelined", though, because in pipelining, I would expect that the
separate steps of a pipeline run in parallel, each one passing its
output to the next step in the pipeline. That allows for a maximum
speedup equal to the number of pipeline steps. IIUC, you have many
more input documents than pipeline steps, so parallelizing by input
data allows for higher speedups.

> The subprocess module should allow spawning new processes, 
> but I am not sure how to get status/error codes back from those?

What's wrong with the returncode attribute in subprocess?

> I guess I can forget about communicating with those?

Assuming all you want to know is whether they succeeded or failed: yes.
Just look at the exit status.

> I guess I would need some sort of queuing system here, to submit files 
> to the CPUs properly?

No. You don't submit files to CPUs, you submit them to processes (if
you "submit" anything at all). The operating system will chose a CPU
for you.

If you follow the architecture I proposed above (i.e. one Python
process processes a file from the beginning to the end), you pass
the file to be processed on the command line. If the processing is
fairly short (which I assume it is not), you could have a single
process process multiple input files, one after another (e.g.
process 1 does files 1..400, process 401..800, and so on).

> The other way could be to have each individual file run through all the 
> steps and have multiple such "pipelines" running simultaneously in parallel. 

Ok. See above - I wouldn't call it piplelined, but this is what you
should do.

> It feels like this method will lose cache performance because all the code 
> for all the steps will be loaded at the same time, but I am not sure if I 
> should be worrying about that.

[I assume you mean "CPU cache" here]

You should not worry about it, plus I very much doubt that this actually
happens. The code for all the steps will *not* be loaded into the cache;
that it sits in main memory has no effect whatsoever on cache
performance. You can assume that all the code that runs a single step
will be in the cache (unless the algorithms are very large). OTOH,
the data of a input file will probably not fit in the CPU cache. You
should be worried about reducing disk IO, so all processing of a file
should run completely in memory. For that, it is better if you run an
input file from the beginning to the end - if you would first read
all 800 files, decompress them, then likely the output will not
fit into the disk cache, so the system will have to read the compressed
data from disk in the next step.

Regards,
Martin