parallel subprocess.getoutput

Adam Skutt askutt at gmail.com
Fri May 11 08:39:00 EDT 2012


On May 11, 8:04 am, Jaroslav Dobrek <jaroslav.dob... at gmail.com> wrote:
> Hello,
>
> I wrote the following code for using egrep on many large files:
>
> MY_DIR = '/my/path/to/dir'
> FILES = os.listdir(MY_DIR)
>
> def grep(regex):
>     i = 0
>     l = len(FILES)
>     output = []
>     while i < l:
>         command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' +
> FILES[i]
>         result = subprocess.getoutput(command)
>         if result:
>             output.append(result)
>         i += 1
>     return output
>
> Yet, I don't think that the files are searched in parallel. Am I
> right? How can I search them in parallel?

subprocess.getoutput() blocks until the command writes out all of its
output, so no, they're not going to be run in parallel.  You really
shouldn't use it anyway, as it's very difficult to use it securely.
Your code, as it stands, could be exploited if the user can supply the
regex or the directory.

There are plenty of tools to do parallel execution in a shell, such
as: http://code.google.com/p/ppss/.  I would use one of those tools
first.

Nevertheless, if you must do it in Python, then the most portable way
to accomplish what you want is to:
0) Create a thread-safe queue object to hold the output.
1) Create each process using a subprocess.Popen object.  Do this
safely and securely, which means NOT passing shell=True in the
constructor, passing stdin=False, and passing stderr=False unless you
intend to capture error output.
2) Spawn a new thread for each process.  That thread should block
reading the Popen.stdout file object.  Each time it reads some output,
it should then write it to the queue.  If you monitor stderr as well,
you'll need to spawn two threads per subprocess.  When EOF is reached,
close the descriptor and call Popen.wait() to terminate the process
(this is trickier with two threads and requires additional
synchronization).
3) After spawning each process, monitor the queue in the first thread
and capture all of the output.
4) Call the join() method on all of the threads to terminate them.
The easiest way to do this is to have each thread write a special
object (a sentinel) to the queue to indicate that it is done.

If you don't mind platform specific code (and it doesn't look like you
do), then you can use fcntl.fcntl to make each file-object non-
blocking, and then use any of the various asynchronous I/O APIs to
avoid the use of threads.  You still need to clean up all of the file
objects and  processes when you are done, though.



More information about the Python-list mailing list