[docs] multiprocessing 'bug'
Alex Leach
albl500 at york.ac.uk
Mon Dec 13 21:21:04 CET 2010
Dear Python developers,
I've been using the multiprocessing module for a month or two now and I love
the speed improvements that come with it. However I find it quite hard to
optimise memory usage amongst the processes. I've tried to avoid using the
'shared state' as much as possible, but every time I start a new process
everything in the parent process's global memory gets duplicated for the child
process, which is totally unnecessary (for me). I understand that this comes
as a side-effect of using os.fork(), which I was totally unaware of until I
read quite a nice article on the IBM website.
When I first started using the multiprocessing module, I would start a process
every time I needed to, which was inside the innermost loop. I later realised
this is completely inefficient because of the time penalties associated with
process creation, and the fact that my program accumulates data as it runs, so
each time a new mult...Process() is started in this manner, it'll always use
slightly more memory than the previous process; i.e. exactly as much memory as
the parent process currently has. This still gave a remarkable performance
improvement on just a single-thread.
So now I initiate all the Processes as close to the start of the main program
as possible. Still, the processes use a lot more memory than they require, and
I'd just rather they didn't... This program I'm currently working on uses the
parent process to create a list of worker Processes, as well as a separate
mult...Process instance I use as a kind of database; it's the only part of the
program that I allow to read from a >20GB text file, sending chunks from that
via a mult..Pipe() to the main process and then down to one of the workers. I
can't figure out which processes to start first though... Whichever process(es)
is/are created first get copied into the global memory of the other
process(es), which isn't ideal.
As I currently subclass multiprocessing.Process, would it be advisable to add
a section to the __init__ function to delete all of the global variables?
Before I figured out why all the processes were using all this memory, I spent
ages minimising the child processes' need for data needed by the parent
process, ensuring any info that the children do need would be passed over in a
pipe or queue. Deleting stuff from global memory doesn't seem like a
recommendable way of doing things, as doesn't stop the memory duplication in
the first place, and could be quite dangerous if I'm deleting subclasses of
Process I've given __del__ functions to ensure that all pipes and queues get
closed (this worked around an earlier problem of having too many open file
handles).
I guess the problem is that I've got nearly every function, class & Process
written into the same .py file; using subprocess would get around this memory
duplication problem, but that would require a lot of painstaking back-tracking
and creating a load of extra .py files. But I also need to pass dictionaries
around, so that wouldn't even work with the subprocess pipes.
I'm using python v2.6.6 btw, on Ubuntu 10.10 server with KDE. Any advice with
this would be greatly appreciated! And sorry about the lengthy email!
Thanks for any help, and kind regards,
Alex Leach
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/docs/attachments/20101213/fb3ecb24/attachment.html>
More information about the docs
mailing list