[Tutor] using multiprocessing efficiently to process large data file

eryksun eryksun at gmail.com
Sun Sep 2 09:16:26 CEST 2012


On Sun, Sep 2, 2012 at 2:41 AM, Alan Gauld <alan.gauld at btinternet.com> wrote:
>
>>      if __name__ == '__main__': # <-- required for Windows
>
> Why?
> What difference does that make in Windows?

It's a hack to get around the fact that Win32 doesn't fork(). Windows
calls CreateProcess(), which loads a fresh interpreter.
multiprocessing then loads the module under a different name (i.e. not
'__main__'). Otherwise another processing Pool would be created, etc,
etc.

This is also why you can't share global data in Windows. A forked
process in Linux uses copy on write, so you can load a large block of
data before calling fork() and share it. In Windows the module is
executed separately for each process, so each has its own copy. To
share data in Windows, I think the fastest option is to use a ctypes
shared Array. The example I wrote is just using the default Pool setup
that serializes (pickle) over pipes.

FYI, the Win32 API imposes the requirement to use CreateProcess(). The
native NT kernel has no problem forking (e.g. for the POSIX
subsystem). I haven't looked closely enough to know why they didn't
implement fork() in Win32.


More information about the Tutor mailing list