Parallel/Multiprocessing script design question

Ivan Voras ivoras at _fer.hr_
Thu Sep 13 09:53:45 EDT 2007


Amit N wrote:

> About 800+ 10-15MB files are generated daily that need to be processed. The 
> processing consists of different steps that the files must go through:
> 
> -Uncompress
> -FilterA
> -FilterB
> -Parse
> -Possibly compress parsed files for archival

You can implement one of two easy straightforward approaches:

1 - Create one program, start N instances of it, where N is the number 
of CPUs/cores, and let each process one file to completion. You'll 
probably need an "overseer" program to start them and dispatch jobs to 
them. The easiest is to start your processes with first N files, then 
monitor them for completion and when any of them finishes, start another 
with the next file in queue, etc.

2 - Create a program / process for each of these steps and let the steps 
operate independently, but feed output from one step to the input of the 
next. You'll probably need some buffering and more control, so that if 
(for example) "FilterA" is slower then "Uncompress", the "Uncompress" 
process is signaled to wait a little until "FilterA" needs more data. 
The key is that, as long as all the steps run at approximatly the same 
speed, they can run in parallel.

Note that both approaches are in principle independent on whether you 
use threads or processes, with the exception of communication between 
the steps/stages, but you can't use threads in python if your goal is 
parallel execution of threads.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20070913/b09e8055/attachment.sig>


More information about the Python-list mailing list