Digging into multiprocessing

Wed Aug 14 10:23:26 EDT 2013

Awesome, thanks for the detailed response Chris.

On Tue, Aug 13, 2013 at 8:03 AM, Chris Angelico <rosuav at gmail.com> wrote:
> On Tue, Aug 13, 2013 at 12:17 AM, Demian Brecht <demianbrecht at gmail.com> wrote:
>> Hi all,
>>
>> Some work that I'm doing atm is in some serious need of
>> parallelization. As such, I've been digging into the multiprocessing
>> module more than I've had to before and I had a few questions come up
>> as a result:
>>
>> (Running 2.7.5+ on OSX)
>>
>> 1. From what I've read, a new Python interpreter instance is kicked
>> off for every worker. My immediate assumption was that the file that
>> the code was in would be reloaded for every instance. After some
>> digging, this is obviously not the case (print __name__ at the top of
>> the file only yield a single output line). So, I'm assuming that
>> there's some optimization that passes of the bytecode within the
>> interpreter? How, exactly does this work? (I couldn't really find much
>> in the docs about it, or am I just not looking in the right place?)
>
> I don't know about OSX specifically, but I believe it forks, same as
> on Linux. That means all your initialization code is done once. Be
> aware that this is NOT the case on Windows.
>
> http://en.wikipedia.org/wiki/Fork_(operating_system)
>
> Effectively, code execution proceeds down a single thread until the
> point of forking, and then the fork call returns twice. Can be messy
> to explain but it makes great sense once you grok it!
>
>> 2. For cases using methods such as map_async/wait, once the bytecode
>> has been passed into the child process, `target` is called `n` times
>> until the current queue is empty. Is this correct?
>
> That would be about right, yes. The intention is that it's equivalent
> to map(), only it splits the work across multiple processes; so the
> expectation is that it will call target for each yielded item in the
> iterable.
>
>> 3. Because __main__ is only run when the root process imports, if
>> using global, READ-ONLY objects, such as, say, a database connection,
>> then it might be better from a performance standpoint to initialize
>> that at main, relying on the interpreter references to be passed
>> around correctly. I've read some blogs and such that suggest that you
>> should create a new database connection within your child process
>> targets (or code called into by the targets). This seems to be less
>> than optimal to me if my assumption is correct.
>
> This depends hugely on the objects you're working with. If your
> database connection uses a TCP socket, for instance, all forked
> processes will share the same socket, which will most likely result in
> interleaved writes and messed-up reads. But with a log file, that
> might be okay (especially if you have some kind of atomicity guarantee
> that ensures that individual log entries don't interleave). The
> problem isn't really the Python objects (which will have been happily
> cloned by the fork() procedure), but the OS-level resources used.
>
> With a good database like PostgreSQL, and reasonable numbers of
> workers (say, 10-50, rather than 1000-5000), you should be able to
> simply establish separate connections for each subprocess without
> worrying about performance. If you really need billions of worker
> processes, it might be best to use one of the multiprocessing module's
> queueing/semaphoring facilities and either have one process that does
> all databasing, or let them all use it but serially. But if you can
> manage with separate connections, that would be the easiest, safest,
> and simplest to debug.
>
>> 4. Related to 3, read-only objects that are initialized prior to being
>> passed into a sub-process are safe to reuse as long as they are
>> treated as being immutable. Any other objects should use one of the
>> shared memory features.
>>
>> Is this more or less correct, or am I just off my rocker?
>
> When you fork, each process will get its own clone of the objects in
> the parent. For read-only objects (module-level constants and such),
> this is fine, as you say. The issue is if you want another process to
> "see" the change you made. That's when you need some form of shared
> data.
>
> So, yes, more or less correct; at least, what you've said is mostly
> right for Unix - there may be some additional caveats for OSX
> specifically that I'm not aware of. But I expect they'll be minor;
> it's mainly Windows, which doesn't *have* fork(2), where there are
> major differences.
>
> ChrisA
> --
> http://mail.python.org/mailman/listinfo/python-list

-- 
Demian Brecht
http://demianbrecht.github.com