[issue38559] async generators aclose() behavior in 3.8

Tue Oct 22 18:35:10 EDT 2019

Nathaniel Smith <njs at pobox.com> added the comment:

I think conceptually, cancelling all tasks and waiting for them to exit is the right thing to do. That way you run as much shutdown code as possible in regular context, and also avoid this problem – if there are no tasks, then there can't be any tasks blocked in __anext__/asend/athrow/aclose, so calling aclose is safe.

This is also the right way to handle it from Trio's perspective – we always wait for all tasks to exit before exiting our run loop. For asyncio it might be a bit more complicated though so we should think through the trade-offs.

> The only way (at least known to me) that that could be the case if some tasks ignores cancellation.  

Looking at runners.py, it seems like asyncio.run doesn't just call task.cancel(), but actually waits for all tasks to exit. (See the call to gather() inside _cancel_all_tasks.) So I think that does guarantee that by the time we hit shutdown_asyncgens(), we can be certain that all tasks have exited. (If a task ignores cancellation, then that will cause _cancel_all_tasks to hang, so we never reach shutdown_asyncgens at all. Not ideal, but not really our problem either – if someone complains we just tell them they need to stop ignoring cancellation if they want  their program to shutdown cleanly.)

So I think asyncio.run is actually totally fine with the 3.8.0 behavior. It's only explicit calls to shutdown_asyncgens that might run into this, and I think that's probably OK? If you're calling shutdown_asyncgens by hand then you're kind of taking responsibility for dealing with any subtle issues around shutdown.

And if we did allow aclose() to run at any time, then I worry that that could cause serious problems. Users can call aclose() at will, and it's generally good practice to always call aclose() on your async generators explicitly. So it's entirely possible that users will accidentally call aclose() themselves while they have another task is blocked in __anext__. And in that case.... what do we do?

The aclose call has to consume the generator's frame, by definition. But in the mean time, we still have this __anext__ coroutine object hanging around pointing to the consumed frame, that the task scheduler thinks needs to be resumed at some point in response to some event, and that can cause all kinds of weird situations. The __anext__ might try to resume while the aclose() is running. The __anext__ might get stuck and never resume, because of the code run by aclose() accidentally unregistering the event that was going to wake it up, so you just get a deadlock instead of a useful error. It might cause some other code entirely to crash – like if the __anext__ was blocked inside a call to 'await queue.get()', then aclose() interrupting that might corrupt the queue's internal state, so the next time some other task calls queue.put() they get an exception or other incorrect behavior. It all seems really tricky and messy to me.

So... this is super subtle and confusing, but I *think* the conclusion is that yeah, 3.8.0 is fine and there's no urgent action needed.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38559>
_______________________________________