[docs] multiprocessing 'bug'

Alex Leach albl500 at york.ac.uk
Mon Dec 13 21:21:04 CET 2010


Dear Python developers,

I've been using the multiprocessing module for a month or two now and I love 
the speed improvements that come with it. However I find it quite hard to 
optimise memory usage amongst the processes. I've tried to avoid using the 
'shared state' as much as possible, but every time I start a new process 
everything in the parent process's global memory gets duplicated for the child 
process, which is totally unnecessary (for me). I understand that this comes 
as a side-effect of using os.fork(), which I was totally unaware of until I 
read quite a nice article on the IBM website.

When I first started using the multiprocessing module, I would start a process 
every time I needed to, which was inside the innermost loop. I later realised 
this is completely inefficient because of the time penalties associated with 
process creation, and the fact that my program accumulates data as it runs, so 
each time a new mult...Process() is started in this manner, it'll always use 
slightly more memory than the previous process; i.e. exactly as much memory as 
the parent process currently has. This still gave a remarkable performance 
improvement on just a single-thread.

So now I initiate all the Processes as close to the start of the main program 
as possible. Still, the processes use a lot more memory than they require, and 
I'd just rather they didn't... This program I'm currently working on uses the 
parent process to create a list of worker Processes, as well as a separate 
mult...Process instance I use as a kind of database; it's the only part of the 
program that I allow to read from a >20GB text file, sending chunks from that 
via a mult..Pipe() to the main process and then down to one of the workers. I 
can't figure out which processes to start first though... Whichever process(es) 
is/are created first get copied into the global memory of the other 
process(es), which isn't ideal.

As I currently subclass multiprocessing.Process, would it be advisable to add 
a section to the __init__ function to delete all of the global variables? 
Before I figured out why all the processes were using all this memory, I spent 
ages minimising the child processes' need for data needed by the parent 
process, ensuring any info that the children do need would be passed over in a 
pipe or queue. Deleting stuff from global memory doesn't seem like a 
recommendable way of doing things, as doesn't stop the memory duplication in 
the first place, and could be quite dangerous if I'm deleting subclasses of 
Process I've given __del__ functions to ensure that all pipes and queues get 
closed (this worked around an earlier problem of having too many open file 
handles).

I guess the problem is that I've got nearly every function, class & Process 
written into the same .py file; using subprocess would get around this memory 
duplication problem, but that would require a lot of painstaking back-tracking 
and creating a load of extra .py files. But I also need to pass dictionaries 
around, so that wouldn't even work with the subprocess pipes.

I'm using python v2.6.6 btw, on Ubuntu 10.10 server with KDE. Any advice with 
this would be greatly appreciated! And sorry about the lengthy email!

Thanks for any help, and kind regards,
Alex Leach




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/docs/attachments/20101213/fb3ecb24/attachment.html>


More information about the docs mailing list