[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Thu Oct 4 04:14:07 EDT 2018

You don't like using Pool.starmap and itertools.repeat or a comprehension
that repeats an object?

On Wed, Oct 3, 2018, 6:30 PM Sean Harrington <seanharr11 at gmail.com> wrote:

> Hi guys -
>
> The solution to "lazily initialize" an expensive object in the worker
> process (i.e. via @lru_cache) is a great solution (that I must admit I did
> not think of). Additionally, in the second use case of "*passing a large
> object to each worker process*", I also agree with your suggestion to
> "shelter functions in a different module to avoid exposure to globals" as a
> good solution if one is wary of globals.
>
> That said, I still think "*passing a large object from parent process to
> worker processes*" should be easier when using Pool. Would either of you
> be open to something like the following?
>
>            def func(x, big_cache=None):
>                return big_cache[x]
>
>            big_cache =  { str(k): k for k in range(10000) }
>
>            ls = [ i for i in range(1000) ]
>
> with Pool(func_kwargs={"big_cache": big_cache}) as pool:
>
>     pool.map(func, ls)
>
>
> It's a much cleaner interface (which presumably requires a more difficult
> implementation) than my initial proposal. This also reads a lot better than
> the "initializer + global" recipe (clear flow of data), and is less
> constraining than the "define globals in parent" recipe. Most importantly,
> when taking sequential code and parallelizing via Pool.map, this does not
> force the user to re-implement "func" such that it consumes a global
> (rather than a kwarg). It allows "func" to be used elsewhere (i.e. in the
> parent process, from a different module, testing w/o globals, etc...)..
>
> This would essentially be an efficient implementation of Pool.starmap(),
> where kwargs are static, and passed to each application of "func" over our
> iterable.
>
> Thoughts?
>
>
> On Sat, Sep 29, 2018 at 3:00 PM Michael Selik <mike at selik.org> wrote:
>
>> On Sat, Sep 29, 2018 at 5:24 AM Sean Harrington <seanharr11 at gmail.com>
>> wrote:
>> >> On Fri, Sep 28, 2018 at 4:39 PM Sean Harrington <seanharr11 at gmail.com>
>> wrote:
>> >> > My simple argument is that the developer should not be constrained
>> to make the objects passed globally available in the process, as this MAY
>> break encapsulation for large projects.
>> >>
>> >> I could imagine someone switching from Pool to ThreadPool and getting
>> >> into trouble, but in my mind using threads is caveat emptor. Are you
>> >> worried about breaking encapsulation in a different scenario?
>> >
>> > >> Without a specific example on-hand, you could imagine a tree of
>> function calls that occur in the worker process (even newly created
>> objects), that should not necessarily have access to objects passed from
>> parent -> worker. In every case given the current implementation, they will.
>>
>> Echoing Antoine: If you want some functions to not have access to a
>> module's globals, you can put those functions in a different module.
>> Note that multiprocessing already encapsulates each subprocesses'
>> globals in essentially a separate namespace.
>>
>> Without a specific example, this discussion is going to go around in
>> circles. You have a clear aversion to globals. Antoine and I do not.
>> No one else seems to have found this conversation interesting enough
>> to participate, yet.
>
>
> >>>
>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181004/4446352a/attachment.html>