[stdlib-sig] futures - a new package for asynchronous execution

Mon Nov 9 08:41:20 CET 2009

On Nov 8, 2009, at 7:01 PM, Jeffrey Yasskin wrote:

> Did you mean to drop the list? Feel free to cc them back in when you  
> reply.

No, that was a brain malfunction. Redirecting the discussion to the  
list.

> On Sat, Nov 7, 2009 at 3:31 PM, Brian Quinlan <brian at sweetapp.com>  
> wrote:
>>
>> On 8 Nov 2009, at 06:37, Jeffrey Yasskin wrote:
>>
>>> On Sat, Nov 7, 2009 at 7:32 AM, Jesse Noller <jnoller at gmail.com>  
>>> wrote:
>>>>
>>>> On Sat, Nov 7, 2009 at 10:21 AM, Antoine Pitrou <solipsis at pitrou.net 
>>>> >
>>>> wrote:
>>>>>
>>>>>> Which API? My comment wasn't aimed at the API of the package -  
>>>>>> in the
>>>>>> time I got to scan it last night nothing jumped out at me as  
>>>>>> overly
>>>>>> offensive API-wise.
>>>>>
>>>>> Not offensive, but probably too complicated if it's meant to be  
>>>>> a simple
>>>>> helper. Anyway, let's wait for the PEP.
>>>>
>>>>
>>>> The PEP is right here:
>>>>
>>>> http://code.google.com/p/pythonfutures/source/browse/trunk/PEP.txt
>>>>
>>>> I'm interested in hearing specific complaints about the API in the
>>>> context of what it's trying to *do*. The only thing which jumped  
>>>> out
>>>> at me was the number of methods on FutureList; but then again, each
>>>> one of those makes conceptual sense, even if they are verbose -
>>>> they're explicit on what's being done.
>>>
>>> Overall, I like the idea of having futures in the standard library,
>>> and I like the idea of pulling common bits of multiprocessing and
>>> threading into a concurrent.* package. Here's my
>>> stream-of-consciousness review of the PEP. I'll try to ** things  
>>> that
>>> really affect the API.
>>>
>>> The "Interface" section should start with a conceptual description  
>>> of
>>> what Executor, Future, and FutureList are. Something like "An  
>>> Executor
>>> is an object you can hand tasks to, which will run them for you,
>>> usually in another thread or process. A Future represents a task  
>>> that
>>> may or may not have completed yet, and which can be waited for and  
>>> its
>>> value or exception queries. A FutureList is ... <haven't read that
>>> far>."
>>>
>>> ** The Executor interface is pretty redundant, and it's missing the
>>> most basic call. Fundamentally, all you need is an
>>> Executor.execute(callable) method returning None,
>>
>> How do you extract the results?
>
> To implement submit in terms of execute, you write something like:
>
> def submit(executor, callable):
>  future = Future()
>  def worker():
>    try:
>      result = callable()
>    except:
>      future.set_exception(sys.exc_info())
>    else:
>      future.set_value(result)
>  executor.execute(worker)
>  return future

I see. I'm not sure if that abstraction is useful but I get it now.

>>> and all the
>>> future-oriented methods can be built on top of that. I'd support  
>>> using
>>> Executor.submit(callable) for the simplest method instead, which
>>> returns a Future, but there should be some way for implementers to
>>> only implement execute() and get submit either for free or with a
>>> 1-line definition. (I'm using method names from
>>>
>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html
>>> in case I'm unclear about the semantics here.) run_to_futures,
>>> run_to_results, and map should be implementable on top of the Future
>>> interface, and shouldn't need to be methods on Executor. I'd  
>>> recommend
>>> they be demoted to helper functions in the concurrent.futures module
>>> unless there's a reason they need to be methods, and that reason
>>> should be documented in the PEP.
>>>
>>> ** run_to_futures() shouldn't take a return_when argument. It should
>>> be possible to wait for those conditions on any list of Futures.
>>> (_not_ just a FutureList)
>>
>> I packaged up Futures into FutureLists to fix an annoyance that I  
>> have with
>> the Java implementation - you have all of these Future objects but no
>> convenient way of operating over them.
>
> Yep, I totally agree with that annoyance. Note, though, that Java has
> the CompletionService to support nearly same use cases as
> run_to_futures.

CompletionService's use case is handling results as they finish (just  
like the callbacks do in Deferreds).

The FutureList use case is querying e.g. which callables raised, which  
returned, which are still running?

>> I made the FutureList the unit of waiting because:
>> 1. I couldn't think of a use case where this wasn't sufficient
>
> Take your webcrawl example. In a couple years, when Futures are widely
> accepted, it's quite possible that urllib.request.urlopen() will
> return a Future instead of a file. Then I'd like to request a bunch of
> URLs and process each as they come back. With the run_to_futures (or
> CompletionService) API, urllib would instead have to take a set of
> requests to open at once, which makes its API much harder to design.
> With a wait-for-any function, urllib could continue to return a single
> Future and let its users combine several results.

If we go down this road then we should just switch to Twisted :-)

Seriously, the idea is that no one would ever change their API to  
accommodate futures - they are a way of making a library with no  
notion of concurrency concurrent.

But I am starting to be convinced that individual futures are a good  
idea because it makes the run/submit method easier to use.

> Alternately, say you have an RPC system returning Futures. You've sent
> off RPCs A, B, and C. Now you need two separate subsystems D and E to
> do something with the results, except that D can continue when either
> A or B finishes, but E can continue when either B or C finishes. Can D
> and E express just what they need to express, or do they have to deal
> with futures they don't really care about?
>
>> 2. It makes the internal locking semantics a bit easier and faster  
>> (if you
>> can wait on any future then the wait has to acquire a lock for  
>> every future
>> [in a consistent order to prevent deadlocks when other threads are  
>> doing the
>> same thing with an intersecting set of futures], add a result  
>> listener for
>> each and then great some sort of object to aggregate their state)
>
> Yep. I suspect the extra overhead isn't significant compared to the
> cost of scheduling threads.
>
>> But I am thinking that maybe FutureLists aren't the right  
>> abstraction.
>>
>>> The code sample looks like Executor is a context manager. What does
>>> its __exit__ do? shutdown()? shutdown&awaitTermination? I prefer
>>> waiting in Executor.__exit__, since that makes it easier for users  
>>> to
>>> avoid having tasks run after they've cleaned up data those tasks
>>> depend on. But that could be my C++ bias, where we have to be sure  
>>> to
>>> free memory in the right places. Frank, does Java run into any
>>> problems with people cleaning things up that an Executor's tasks
>>> depend on without awaiting for the Executor first?
>>>
>>> shutdown should explain why it's important. Specifically, since the
>>> Executor controls threads, and those threads hold a reference to the
>>> Executor, nothing will get garbage collected without the explicit
>>> call.
>>
>> Actually, the threads hold a weakref to the Executor so they can  
>> exit (when
>> the work queue is empty) if the Executor is collected. Here is the  
>> code from
>> futures/thread.py:
>>
>>  while True:
>>    try:
>>        work_item = work_queue.get(block=True, timeout=0.1)
>>    except queue.Empty:
>>        executor = executor_reference()
>>        # Exit if:
>>        #   - The interpreter is shutting down OR
>>        #   - The executor that owns the worker has been collected OR
>>        #   - The executor that owns the worker has been shutdown.
>>        if _shutdown or executor is None or executor._shutdown:
>>            return
>
> Oh, yeah, that sounds like it might work. So why does shutdown exist?

It does work - there are tests and everything :-)

.shutdown exists for the same reason that .close exists on files:
- Python does not guarantee any particular GC strategy
- tracebacks and other objects may retain a reference in an unexpected  
way
- sometimes you want to free your resources before the function exits

>
>>> ** What happens when FutureList.wait(FIRST_COMPLETED) is called  
>>> twice?
>>> Does it return immediately the second time? Does it wait for the
>>> second task to finish? I'm inclined to think that FutureList  
>>> should go
>>> away and be replaced by functions that just take lists of Futures.
>>
>> It waits until a new future is completed.
>
> That seems confusing, since it's no longer the "FIRST" completed.

Maybe "NEXT_COMPLETED" would be better.

Cheers,
Brian

>
>>> In general, I think the has_done_futures(), exception_futures(),  
>>> etc.
>>> are fine even though their results may be out of date by the time  
>>> you
>>> inspect them. That's because any individual Future goes  
>>> monotonically
>>> from not-started->running->(exception|value), so users can take
>>> advantage of even an out-of-date done_futures() result. However,  
>>> it's
>>> dangerous to have several query functions, since users may think  
>>> that
>>> running_futures() `union` done_futures() `union` cancelled_futures()
>>> covers the whole FutureList, but instead a Future can move between  
>>> two
>>> of the sets between two of those calls. Instead, perhaps an atomic
>>> partition() function would be better, which returns a collection of
>>> sub-lists that cover the whole original set.
>>>
>>> I would rename result() to get() (or maybe Antoine's suggestion of
>>> __call__) to match Java. I'm not sure exception() needs to exist.
>>>
>>> --- More general points ---
>>>
>>> ** Java's Futures made a mistake in not supporting work stealing,  
>>> and
>>> this has caused deadlocks at Google. Specifically, in a bounded-size
>>> thread or process pool, when a task in the pool can wait for work
>>> running in the same pool, you can fill up the pool with tasks that  
>>> are
>>> waiting for tasks that haven't started running yet. To avoid this,
>>> Future.get() should be able to steal the task it's waiting on out of
>>> the pool's queue and run it immediately.
>>>
>>> ** I think both the Future-oriented blocking model and the
>>> callback-based model Deferreds support are important for different
>>> situations. Futures tend to be easier to program with, while  
>>> Deferreds
>>> use fewer threads and can have more predictable latencies. It should
>>> be possible to create a Future from a Deferred or a Deferred from a
>>> Future without taking up a thread.