[Python-ideas] Concurrency Modules

Andrew Barnert abarnert at yahoo.com
Sat Jul 11 22:56:31 CEST 2015


On Jul 11, 2015, at 08:00, Nikolaus Rath <Nikolaus at rath.org> wrote:
> 
>> On Jul 10 2015, Nick Coghlan <ncoghlan-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
>>> On 10 July 2015 at 12:09, Chris Angelico <rosuav-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
>>>> On Fri, Jul 10, 2015 at 8:53 AM, Sven R. Kunze <srkunze-7y4VAllY4QU at public.gmane.org> wrote:
>>>> After discussing the whole topic and reading it up further, it became clear
>>>> to me what's actually missing in Python. That is a definitive guide of
>>>> why/when a certain concurrency module is supposed to be used
>>> 
>>> I'm not sure how easy the decisions will be in all cases, but
>>> certainly some broad guidelines would be awesome. (The exact analysis
>>> of "when should I use threads and when should I use processes" is a
>>> big enough one that there've been a few million blog posts on the
>>> subject, and I doubt that asyncio will shrink that.) A basic summary
>>> would be hugely helpful. "Here's four similar modules, and why they
>>> all exist in the standard library."
>> 
>> Q: Why are there four different modules
>> A: Because they solve different problems
>> Q: What are those problems?
>> A: How long have you got?
>> 
>> Choosing an appropriate concurrency model for a problem is one of the
>> hardest tasks in software architecture design. The only way to make it
>> appear simple is to focus in on a specific class of problems where
>> there *is* a single clearly superior answer for that problem domain :)
> 
> But even just documenting this subset would already provide a lot of
> improvement over the status quo.
> 
> If for each module there were an example of a problem that's clearly
> best solved with this module rather than any of the others, that's a
> perfectly good anwser to the question why they all exist. 

Assuming coroutines/asyncio are not the answer for your problem, it's not really a choice between 3 modules; rather, there are 3 separate binary decisions to make, which lead to 6 different possibilities (not 8, because 2 of them are less useful and therefore Python doesn't have them): futures.ProcessPoolExecutor, futures.ThreadPoolExecutor, multiprocessing.Pool, multiprocessing.dummy.Pool (unfortunately, this is where thread pools lie...), multiprocessing.Process, or threading.Thread.

Explaining pools vs. separate threads is pretty easy. If you're doing a whole bunch of similar things (download 1000 files, do this computation on every row of a giant matrix), you want pools; if you're doing distinctly different things (update the backup for this file, send that file to the printer, and download the updated version from the net), you don't.

Explaining plain pools vs. executors is a little trickier, because for the simplest cases there's no obvious difference. Coming up with a case where you need to compose futures isn't that hard; coming up with a case where you need one of the lower-level pool features (like explicitly managing batching) without getting too artificial to be meaningful or too complicated to serve as an example is a bit harder. But still not that big of a problem.

Explaining threads vs. processes is two questions in itself.

First, if you're looking at concurrency to speed up your code, and your code is CPU-bound, then your answer to the other question doesn't matter; you need processes. (Unless you're using a C extension that release the GIL, or using Jython instead of CPython, or ...)

So finally we get to the big problem: shared state. Even ignoring the Python- and CPython-specific issues (forking, what the GIL makes atomic, ...), just explaining the basic ideas of what shared state means, when you need it, why you're wrong, what races are, how to synchronize, why mutability matters... Is that really something that can be fit into a HOWTO?

But if you punt on that and just say "until you know what you're doing, everything should be written in the message-passing-tasks style", you might as well skip the whole HOWTO and say "always use concurrent.futures.ProcessPoolExecutor".




More information about the Python-ideas mailing list