[Python-ideas] fork - other approaches

Sun Aug 2 01:30:05 CEST 2015

On Aug 1, 2015, at 10:29, Sven R. Kunze <srkunze at mail.de> wrote:
> 
> Thanks everybody for inspiring me with alternative ways of working with pools.
> 
> I am very certain that any them will work as intended. However, they do not zero in 100% on my main intentions:
> 
> 1) easy to understand
> 2) exchangeable (seq <-> par)
> 
> 
> A) pmap
> 
> It origates from map and allows easy exchangeability back and forth sequential and concurrent/parallel execution.
> 
> However, I have to admit that I have difficulties to change all the 'for loops' to map (mentally as well as for real).

You probably don't have to--or want to--change all the for loops. It's very rare that you have a huge sequence of separate loops that all contribute equally to performance and are all parallelizable with the same granularity and so on. Usually, there is one loop that you want to parallelize, and that solves the problem for your entire program.

> The 'for loop' IS the most used loop construct in business applications and I do not see it going away because of something else (such as map).

Of course the for statement isn't going away. Neither are comprehensions. And neither are map and other higher-order functions. They do related but slightly different things, and a language that tried to force them all into the same construct would be an unpleasant language. That's why they've coexisted for decades in Python without any of them going away.

But you're the one who's trying to do that. In order to avoid having to learn about any other ways to write flow control, you want to change the language so you can disguise all flow control as the kind you already know how to write.

> B) with Pool()
> 
> It removes the need to close and join the pool which removes the visual clutter from the source code. That as such is great.

It also means you can't forget to clean up the pool, you can't accidentally try to use the results before they're ready, etc. The with statement is one of the key tools in using Python effectively, and I personally wouldn't trust a developer who didn't understand it to start doing multicore optimizations on my code.

Also, if you're learning from the examples at the top of the docs and haven't seen with Pool before, I suspect either you're still using Python 2.x (in which case you need to upgrade to 3.5 before you can start proposing new features for 3.6) or reading the 2.7 docs while using 3.x (in which case, don't do that).

> However, exchangeability is clearly not given and the same issue concerning understandability like pmap arises.

It's still calling map, so if you don't understand even the basics of higher-order functions, I suppose you still won't understand it. But again, that's a pretty basic and key thing, and I wouldn't go assigning multicore optimization tasks to a developer who couldn't grasp the concept.

> C) apply
> 
> Nick's approach of providing a 'call_in_background' solution comes almost close to what would solve the issues at hand.
> 
> However, it reminds me of apply (deprecated built-in function for calling other functions). So, a better name for it would be 'bg_apply'.

The problem with apply is that it's almost always completely unnecessary, and can be written as a one-liner when it is; its presence encouraged people from other languages where it _is_ necessary to overuse it in Python.

But unfortunately, there is a bit of a mix between functions that "apply" other functions--including Pool.apply_async--and those that "call" other functions and even those they "submit" them. There's really no difference, so it would be nice if Python were consistent in the naming. And, since Pool uses the "apply" terminology, I think you may be right here.

I disagree about abbreviating background to "bg", however. You're only going to be writing this a few times in your program, but you'll be reading those few places quite often, and the fact that they're backgrounding code will likely be important to understanding and debugging that code. So I'd stick with the PEP 8 recommendation and spell it out.

But of course your mileage may vary. Since this is a function you're writing based on Nick's blog post, you can call it whatever makes sense in your particular app. (And, even if it makes it into the stdlib, there's nothing stopping you from writing
"bg_apply = apply_in_background" or "from asyncio import apply_in_background as bg_apply" if you really want to.)

> All of these approaches basically rip the function call out of the programmer's view.
> 
> It is no longer
> 
>     function(arg)
> 
> but
> 
>     apply(function, arg)              # or
>     bg_apply(function, arg)           # or
>     bg_apply_many(function, args)
> 
> 
> I don't see this going well in production and in code reviews.

Using a higher-order function when there's no need for it certainly should be rejected in code review--which is why Python no longer has the "apply" function.

But using one when it's appropriate--like calling map when you want to map a function over an iterable and get back and iterable of results--is a different story. If you're afraid of doing that because you're afraid it won't pass code reviews, then either you have insufficient faith in your coworkers, or you need to find a new job.

> So, an expression keyword like 'fork' would still be better at least from my perspective. It would tell me: 'it's not my responsibility anymore; delegate this to someone else and get me a handle of the future result'.

You still haven't answered any of the issues I or anyone else raised with this: fork strongly implies forking new processes rather than submitting to a pool, there's no obvious or visible way to control what kind of pool you're using it how you're using it, there's nowhere to look up what kind of future-like object you get back or what its API is, it's insufficiently useful as a statement but looks clumsy and unpythonic as an expression, etc. Using Pool.map--or Executor.map, which is what I think you really want here (it provides real composable futures, it lets you switch between threads and processes in one central place, etc., and you appear to have no need for the lower-level features of the pool, like controlling batching)--avoids all of those problems.

It's worth noting that there are some languages where a solution like this could be more appropriate. For example, in a pure immutable functional language, you really could just have the user start up tasks and let the implementation decide how to pool things, how to partition them among green threads/OS threads/processes manually, etc. because that would be a transparent optimization. For example, an Erlang implementation could use static analysis or runtime tracing to recognize that some processes communicate more heavily than others and partition them into OS processes in a way that minimizes the cost of that communication, and that would be pretty nifty. But a Python implementation couldn't do that, because any of those tasks might write to a shared variable that another task needs, or try to return some unpicklable object, etc. Of course the hope is that in the long run, something like PyPy's STM will be so universally usable that neither you nor the implementation will ever need to make such decisions. But until then, it has to be you, the user, who makes them.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150801/8cba10fd/attachment-0001.html>