From pfreixes at gmail.com Mon Jan 1 12:02:34 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Mon, 1 Jan 2018 18:02:34 +0100 Subject: [Async-sig] Asyncio loop instrumentation In-Reply-To: <20171231200247.755b5c42@fsol> References: <20171231200247.755b5c42@fsol> Message-ID: HI Antonie, Regarding your questions > > What does it mean exactly? Is it the ratio of CPU time over wall clock > time? This can be considered a metric that informs you how much CPU resources are being consumed by your loop, in the best case scenario where there is only your process, this metric will match with the CPU usage - important notice that will match with CPU where your process is executed. Having many processes fighting for the same CPU this number will be significantly different, taking into account that the resources are being divided by many consumers. Therefore I would like to notice that this load is relative to your loop rather than an objective value taken from the CPU metric. To make so with `psutil` you must gather the CPU usage from that specific CPU where your loop is currently running. Not an impossible problem but making it from something trivial to something more complicated. In the case of the `time.thread_time` I cant see how I could do that. You would gather information related to the thread where your loop is currently running, but there s nothing straightforward that will help you to take into account other threads that are fighting for that specific CPU. The solution presented is not perfect, and there is still some corner cases where the load factor might not be enough accurate. The way of the `load` method has to guess if the loop is fighting for the CPU resources with other processes is basically attributing only at maximum the timeout as sleeping time, perhaps: t0 = time() select(fds, timeout=1) t1 = time() sleeping_time = min(t1 - t0, 1) Therefore, if the call to the select took more than 1 second because the scheduler decided to give the CPU to another process this lambda time that goes beyond 1 second will be considered as resource usage time. As you can imagine, the problem with that is what happens when the select was ready before of 1 second, and the schedule did not give back the CPU because there was another more priority process, in that case, this time will be attributed as sleeping time. >> For this proposal [4], POC, I've preferred make a reduced list of events: >> >> * `loop_start` : Executed when the loop starts for the first time. >> * `tick_start` : Executed when a new loop tick is started. >> * `io_start` : Executed when a new IO process starts. >> * `io_end` : Executed when the IO process ends. >> * `tick_end` : Executed when the loop tick ends. >> * `loop_stop` : Executed when the loop stops. > > What do you call a "IO process" in this context? Basically the call to the `select/poll/whatever` syscall that will ask for read or write to a set of file descriptors. Thanks, -- --pau From songofacandy at gmail.com Mon Jan 1 19:34:13 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Tue, 2 Jan 2018 09:34:13 +0900 Subject: [Async-sig] Asyncio loop instrumentation In-Reply-To: References: <20171231200247.755b5c42@fsol> Message-ID: >>> For this proposal [4], POC, I've preferred make a reduced list of events: >>> >>> * `loop_start` : Executed when the loop starts for the first time. >>> * `tick_start` : Executed when a new loop tick is started. >>> * `io_start` : Executed when a new IO process starts. >>> * `io_end` : Executed when the IO process ends. >>> * `tick_end` : Executed when the loop tick ends. >>> * `loop_stop` : Executed when the loop stops. >> >> What do you call a "IO process" in this context? > > Basically the call to the `select/poll/whatever` syscall that will ask > for read or write to a set of file descriptors. `select/poll/whatever` syscalls doesn't ask for read or write. It waits for read or write (more accurate, waits for readable or writable state). So poll_start / poll_end looks better name to me. INADA Naoki > > Thanks, > > -- > --pau > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ From pfreixes at gmail.com Tue Jan 2 11:32:12 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Tue, 2 Jan 2018 17:32:12 +0100 Subject: [Async-sig] Asyncio loop instrumentation In-Reply-To: <2B052CB7-FFCB-494C-97BA-DA8859B49598@gmail.com> References: <20171231200247.755b5c42@fsol> <2B052CB7-FFCB-494C-97BA-DA8859B49598@gmail.com> Message-ID: Hi Yuri, Its good to know that we are on the same page regarding the lack of a feature that should be a must. Since Asyncio has become stable and widely used by many organizations - such as us [1], the needs of tools that allow us to instrumentalize asynchronous code that runs on top of Asyncio have increased. A good example is how some changes in Aiohttp were implemented [2] - disclaimer, I'm the author of this code part - to allow the developers to gather more information about how the HTTP calls perform at both layers, application, and protocol. This proposal, just a POC, goes in the same direction and tries to mitigate this lack for the event loop. The related work regarding the `load` method is conjunctural but helps to understand why this feature is such important. I still believe that we can start to fill the gap for Python 3.7, if finally the window time to implement it gets closed before all work is done at least we will have some work done. I still have some questions to be answered that might help to focus this work in the right way. Few of them as a proof of the rationale. Perhaps, how much coupled has to be this feature to the AbstractLoop making it a specification for other loop implementations. And others purely technical. But, it's true that we must go further with this questions if we believe that we can take advantage of all of this effort. Regards, [1] https://medium.com/@SkyscannerEng/running-aiohttp-at-scale-2656b7a83a09 [2] https://github.com/aio-libs/aiohttp/pull/2429 On Sun, Dec 31, 2017 at 8:12 PM, Yury Selivanov wrote: > When PEP 567 is accepted, I plan to implement advanced instrumentation in uvloop, to monitor basically all io/callback/loop events. I'm still -1 to do this in asyncio at least in 3.7, because i'd like us to have some time to experiment with such instrumentation in real production code (preferably at scale) > > Yury > > Sent from my iPhone > >> On Dec 31, 2017, at 10:02 PM, Antoine Pitrou wrote: >> >> On Sun, 31 Dec 2017 18:32:21 +0100 >> Pau Freixes wrote: >>> >>> These new implementation of the load method - remember that it returns >>> a load factor between 0.0 and 1.0 that inform you about how bussy is >>> your loop - >> >> What does it mean exactly? Is it the ratio of CPU time over wall clock >> time? >> >> Depending on your needs, the `psutil` library (*) and/or the new >> `time.thread_time` function (**) may also help. >> >> (*) https://psutil.readthedocs.io/en/latest/ >> (**) https://docs.python.org/3.7/library/time.html#time.thread_time >> >>> For this proposal [4], POC, I've preferred make a reduced list of events: >>> >>> * `loop_start` : Executed when the loop starts for the first time. >>> * `tick_start` : Executed when a new loop tick is started. >>> * `io_start` : Executed when a new IO process starts. >>> * `io_end` : Executed when the IO process ends. >>> * `tick_end` : Executed when the loop tick ends. >>> * `loop_stop` : Executed when the loop stops. >> >> What do you call a "IO process" in this context? >> >> Regards >> >> Antoine. >> >> >> _______________________________________________ >> Async-sig mailing list >> Async-sig at python.org >> https://mail.python.org/mailman/listinfo/async-sig >> Code of Conduct: https://www.python.org/psf/codeofconduct/ > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ -- --pau From pfreixes at gmail.com Tue Jan 2 12:00:18 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Tue, 2 Jan 2018 18:00:18 +0100 Subject: [Async-sig] Asyncio loop instrumentation In-Reply-To: References: <20171231200247.755b5c42@fsol> Message-ID: Agree, poll_start and poll_end suit much better. Thanks for the feedback. On Tue, Jan 2, 2018 at 1:34 AM, INADA Naoki wrote: >>>> For this proposal [4], POC, I've preferred make a reduced list of events: >>>> >>>> * `loop_start` : Executed when the loop starts for the first time. >>>> * `tick_start` : Executed when a new loop tick is started. >>>> * `io_start` : Executed when a new IO process starts. >>>> * `io_end` : Executed when the IO process ends. >>>> * `tick_end` : Executed when the loop tick ends. >>>> * `loop_stop` : Executed when the loop stops. >>> >>> What do you call a "IO process" in this context? >> >> Basically the call to the `select/poll/whatever` syscall that will ask >> for read or write to a set of file descriptors. > > `select/poll/whatever` syscalls doesn't ask for read or write. > It waits for read or write (more accurate, waits for readable or > writable state). > > So poll_start / poll_end looks better name to me. > > INADA Naoki > > >> >> Thanks, >> >> -- >> --pau >> _______________________________________________ >> Async-sig mailing list >> Async-sig at python.org >> https://mail.python.org/mailman/listinfo/async-sig >> Code of Conduct: https://www.python.org/psf/codeofconduct/ -- --pau From yselivanov at gmail.com Tue Jan 2 12:46:04 2018 From: yselivanov at gmail.com (Yury Selivanov) Date: Tue, 2 Jan 2018 20:46:04 +0300 Subject: [Async-sig] Asyncio loop instrumentation In-Reply-To: References: <20171231200247.755b5c42@fsol> Message-ID: I understand why it could be useful to have this in asyncio. But I'm big -1 on rushing this functionality in 3.7. asyncio is no longer provisional, so we have to be careful when we design new APIs for it. Example: I wanted to add support for Task groups to asyncio. A similar concept exists in curio and trio and I like it, it can be a big improvement over asyncio.gather. But there are too many caveats about handling multiple exceptions properly (MultiError?) and some issues with cancellation. That's why I decided that it's safer to prototype TaskGroups in a separate package, than to push a poorly thought out new API in 3.7. Same applies to your proposal. You can easily publish a package on PyPI that provides an improved version of asyncio event loop. You won't even need to write a lot of code, just overload a few methods. Yury Sent from my iPhone > On Jan 2, 2018, at 8:00 PM, Pau Freixes wrote: > > Agree, poll_start and poll_end suit much better. > > Thanks for the feedback. > > On Tue, Jan 2, 2018 at 1:34 AM, INADA Naoki wrote: >>>>> For this proposal [4], POC, I've preferred make a reduced list of events: >>>>> >>>>> * `loop_start` : Executed when the loop starts for the first time. >>>>> * `tick_start` : Executed when a new loop tick is started. >>>>> * `io_start` : Executed when a new IO process starts. >>>>> * `io_end` : Executed when the IO process ends. >>>>> * `tick_end` : Executed when the loop tick ends. >>>>> * `loop_stop` : Executed when the loop stops. >>>> >>>> What do you call a "IO process" in this context? >>> >>> Basically the call to the `select/poll/whatever` syscall that will ask >>> for read or write to a set of file descriptors. >> >> `select/poll/whatever` syscalls doesn't ask for read or write. >> It waits for read or write (more accurate, waits for readable or >> writable state). >> >> So poll_start / poll_end looks better name to me. >> >> INADA Naoki >> >> >>> >>> Thanks, >>> >>> -- >>> --pau >>> _______________________________________________ >>> Async-sig mailing list >>> Async-sig at python.org >>> https://mail.python.org/mailman/listinfo/async-sig >>> Code of Conduct: https://www.python.org/psf/codeofconduct/ > > > > -- > --pau > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ From njs at pobox.com Thu Jan 11 05:09:29 2018 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 11 Jan 2018 02:09:29 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans Message-ID: Hi all, Folks here might be interested in this new blog post: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ It's a detailed discussion of pitfalls and design-tradeoffs in APIs for timeout and cancellation, and has a proposal for handling them in a more Pythonic way. Any feedback welcome! -n -- Nathaniel J. Smith -- https://vorpus.org From dimaqq at gmail.com Thu Jan 11 22:49:56 2018 From: dimaqq at gmail.com (Dima Tisnek) Date: Fri, 12 Jan 2018 11:49:56 +0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: Very nice read, Nathaniel. The post left me wondering how cancel tokens interact or should logically interact with async composition, for example: with move_on_after(10): await someio.gather(a(), b(), c()) or with move_on_after(10): await someio.first/race(a(), b(), c()) or dataset = someio.Future(large_download(), move_on_after=9999) task a: with move_on_after(10): use((await dataset)["a"]) task b: with move_on_after(10): use((await dataset)["b"]) On 11 January 2018 at 18:09, Nathaniel Smith wrote: > Hi all, > > Folks here might be interested in this new blog post: > > https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ > > It's a detailed discussion of pitfalls and design-tradeoffs in APIs > for timeout and cancellation, and has a proposal for handling them in > a more Pythonic way. Any feedback welcome! > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ From chris.jerdonek at gmail.com Fri Jan 12 07:17:51 2018 From: chris.jerdonek at gmail.com (Chris Jerdonek) Date: Fri, 12 Jan 2018 04:17:51 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: Thanks, Nathaniel. Very instructive, thought-provoking write-up! One thing occurred to me around the time of reading this passage: > "Once the cancel token is triggered, then all future operations on that token are cancelled, so the call to ws.close doesn't get stuck. It's a less error-prone paradigm. ... If you follow the path we did in this blog post, and start by thinking about applying a timeout to a complex operation composed out of multiple blocking calls, then it's obvious that if the first call uses up the whole timeout budget, then any future calls should fail immediately." One case that's not clear how should be addressed is the following. It's something I've wrestled with in the context of asyncio, and it doesn't seem to be raised as a possibility in your write-up. Say you have a complex operation that you want to be able to timeout or cancel, but the process of cleanup / cancelling might also require a certain amount of time that you'd want to allow time for (likely a smaller time in normal circumstances). Then it seems like you'd want to be able to allocate a separate timeout for the clean-up portion (independent of the timeout allotted for the original operation). It's not clear to me how this case would best be handled with the primitives you described. In your text above ("then any future calls should fail immediately"), without any changes, it seems there wouldn't be "time" for any clean-up to complete. With asyncio, one way to handle this is to await on a task with a smaller timeout after calling task.cancel(). That lets you assign a different timeout to waiting for cancellation to complete. --Chris On Thu, Jan 11, 2018 at 2:09 AM, Nathaniel Smith wrote: > Hi all, > > Folks here might be interested in this new blog post: > > https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ > > It's a detailed discussion of pitfalls and design-tradeoffs in APIs > for timeout and cancellation, and has a proposal for handling them in > a more Pythonic way. Any feedback welcome! > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ From njs at pobox.com Sat Jan 13 05:32:49 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 13 Jan 2018 02:32:49 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Thu, Jan 11, 2018 at 7:49 PM, Dima Tisnek wrote: > Very nice read, Nathaniel. > > The post left me wondering how cancel tokens interact or should > logically interact with async composition, for example: > > with move_on_after(10): > await someio.gather(a(), b(), c()) > > or > > with move_on_after(10): > await someio.first/race(a(), b(), c()) > > or > > dataset = someio.Future(large_download(), move_on_after=9999) > > task a: > with move_on_after(10): > use((await dataset)["a"]) > > task b: > with move_on_after(10): > use((await dataset)["b"]) It's funny you say "async composition"... Trio's concurrency primitive (nurseries) is closely related to the core concurrency primitive in Communicating Sequential Processes, which they call "parallel composition". (Basically, if P and Q are processes, then "P || Q" is the process that runs both P and Q in parallel and then finishes when they've both finished.) If you were using that as your primitive, then tasks would form an orderly tree and this wouldn't be a problem :-). Given asyncio's actual primitives though, then yeah, this is clearly the big question, and I doubt there are any simple answers; so far my ambition has just been to articulate the problem well enough to start that conversation (see also the "asyncio" section in the blog post). One possibility might be a hybrid cancel token / cancel scope API: create a first class cancel token API like C# has, enhance make the low-level asyncio APIs to use them, and then on top of that add mechanisms to attach a stack of implicitly-applied cancel tokens to each task? That's just a vague handwave of an idea so far though. Note that last case is the one where asyncio cancellation semantics are already... well, surprising, anyway. If you cancel task a then task b will receive a CancelledError, even though task a was not cancelled. (I talked about this a bit in my "Some thoughts ..." blog post; search for "spooky-cancellation-at-a-distance.py".) -n -- Nathaniel J. Smith -- https://vorpus.org From njs at pobox.com Sun Jan 14 06:33:44 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 14 Jan 2018 03:33:44 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Fri, Jan 12, 2018 at 4:17 AM, Chris Jerdonek wrote: > Thanks, Nathaniel. Very instructive, thought-provoking write-up! > > One thing occurred to me around the time of reading this passage: > >> "Once the cancel token is triggered, then all future operations on that token are cancelled, so the call to ws.close doesn't get stuck. It's a less error-prone paradigm. ... If you follow the path we did in this blog post, and start by thinking about applying a timeout to a complex operation composed out of multiple blocking calls, then it's obvious that if the first call uses up the whole timeout budget, then any future calls should fail immediately." > > One case that's not clear how should be addressed is the following. > It's something I've wrestled with in the context of asyncio, and it > doesn't seem to be raised as a possibility in your write-up. > > Say you have a complex operation that you want to be able to timeout > or cancel, but the process of cleanup / cancelling might also require > a certain amount of time that you'd want to allow time for (likely a > smaller time in normal circumstances). Then it seems like you'd want > to be able to allocate a separate timeout for the clean-up portion > (independent of the timeout allotted for the original operation). > > It's not clear to me how this case would best be handled with the > primitives you described. In your text above ("then any future calls > should fail immediately"), without any changes, it seems there > wouldn't be "time" for any clean-up to complete. > > With asyncio, one way to handle this is to await on a task with a > smaller timeout after calling task.cancel(). That lets you assign a > different timeout to waiting for cancellation to complete. You can get these semantics using the "shielding" feature, which the post discusses a bit later: try: await do_some_stuff() finally: # Always give this 30 seconds to clean up, even if we've # been cancelled with trio.move_on_after(30) as cscope: cscope.shield = True await do_cleanup() Here the inner scope "hides" the code inside it from any external cancel scopes, so it can continue executing even of the overall context has been cancelled. However, I think this is probably a code smell. Like all code smells, there are probably cases where it's the right thing to do, but when you see it you should stop and think carefully. If you're writing code like this, then it means that there are multiple different layers in your code that are implementing timeout policies, that might end up fighting with each other. What if the caller really needs this to finish in 15 seconds? So if you have some way to move the timeout handling into the same layer, then I suspect that will make your program easier to understand and maintain. OTOH, if you decide you want it, the code above works :-). I'm not 100% sure here; I'd definitely be interested to hear about more use cases. One thing I've thought about that might help is adding a kind of "soft cancelled" state to the cancel scopes, inspired by the "graceful shutdown" mode that you'll often see in servers where you stop accepting new connections, then try to finish up old ones (with some time limit). So in this case you might mark 'do_some_stuff()' as being cancelled immediately when we entered the 'soft cancel' phase, but let the 'do_cleanup' code keep running until the grace period expired and the region was hard-cancelled. This idea isn't fully baked yet though. (There's some more mumbling about this at https://github.com/python-trio/trio/issues/147.) -n -- Nathaniel J. Smith -- https://vorpus.org From chris.jerdonek at gmail.com Sun Jan 14 08:11:51 2018 From: chris.jerdonek at gmail.com (Chris Jerdonek) Date: Sun, 14 Jan 2018 05:11:51 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Sun, Jan 14, 2018 at 3:33 AM, Nathaniel Smith wrote: > On Fri, Jan 12, 2018 at 4:17 AM, Chris Jerdonek > wrote: >> Say you have a complex operation that you want to be able to timeout >> or cancel, but the process of cleanup / cancelling might also require >> a certain amount of time that you'd want to allow time for (likely a >> smaller time in normal circumstances). Then it seems like you'd want >> to be able to allocate a separate timeout for the clean-up portion >> (independent of the timeout allotted for the original operation). >> ... > > You can get these semantics using the "shielding" feature, which the > post discusses a bit later: > ... > However, I think this is probably a code smell. I agree with this assessment. My sense was that shielding could probably do it, but it seems like it could be brittle or more of a kludge. It would be nice if the same primitive could be used to accommodate this and other variations in addition to the normal case. For example, a related variation might be if you wanted to let yourself extend the timeout in response to certain actions or results. The main idea that occurs to me is letting the cancel scope be dynamic: the timeout could be allowed to change in response to certain things. Something like that seems like it has the potential to be both simple as well as general enough to accommodate lots of different scenarios, including adjusting the timeout in response to entering a clean-up phase. One good test would be whether shielding could be implemented using such a primitive. --Chris > Like all code smells, > there are probably cases where it's the right thing to do, but when > you see it you should stop and think carefully. If you're writing code > like this, then it means that there are multiple different layers in > your code that are implementing timeout policies, that might end up > fighting with each other. What if the caller really needs this to > finish in 15 seconds? So if you have some way to move the timeout > handling into the same layer, then I suspect that will make your > program easier to understand and maintain. OTOH, if you decide you > want it, the code above works :-). I'm not 100% sure here; I'd > definitely be interested to hear about more use cases. > > One thing I've thought about that might help is adding a kind of "soft > cancelled" state to the cancel scopes, inspired by the "graceful > shutdown" mode that you'll often see in servers where you stop > accepting new connections, then try to finish up old ones (with some > time limit). So in this case you might mark 'do_some_stuff()' as being > cancelled immediately when we entered the 'soft cancel' phase, but let > the 'do_cleanup' code keep running until the grace period expired and > the region was hard-cancelled. This idea isn't fully baked yet though. > (There's some more mumbling about this at > https://github.com/python-trio/trio/issues/147.) > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org From nbadger1 at gmail.com Sun Jan 14 17:45:28 2018 From: nbadger1 at gmail.com (Nick Badger) Date: Sun, 14 Jan 2018 14:45:28 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: > > However, I think this is probably a code smell. Like all code smells, > there are probably cases where it's the right thing to do, but when > you see it you should stop and think carefully. Huh. That's a really good point. But I'm not sure the source of the smell is the code that needs the shield logic -- I think this might instead be indicative of upstream code smell. Put a bit more concretely: if you're writing a protocol for an unreliable network (and of course, every network is unreliable), requiring a closure operation to transmit something over that network is inherently problematic, because it inevitably leads to multiple-stage timeouts or ungraceful shutdowns. Clearly, changing anything upstream is out of scope here. So if the smell is, in fact, "upwind", there's not really much you could do about that in asyncio, Curio, Trio, etc, other than minimize the additional smell you need to accommodate smelly protocols. Unfortunately, I'm not sure there's any one approach to that problem that isn't application-specific. Nick Badger https://www.nickbadger.com 2018-01-14 3:33 GMT-08:00 Nathaniel Smith : > On Fri, Jan 12, 2018 at 4:17 AM, Chris Jerdonek > wrote: > > Thanks, Nathaniel. Very instructive, thought-provoking write-up! > > > > One thing occurred to me around the time of reading this passage: > > > >> "Once the cancel token is triggered, then all future operations on that > token are cancelled, so the call to ws.close doesn't get stuck. It's a less > error-prone paradigm. ... If you follow the path we did in this blog post, > and start by thinking about applying a timeout to a complex operation > composed out of multiple blocking calls, then it's obvious that if the > first call uses up the whole timeout budget, then any future calls should > fail immediately." > > > > One case that's not clear how should be addressed is the following. > > It's something I've wrestled with in the context of asyncio, and it > > doesn't seem to be raised as a possibility in your write-up. > > > > Say you have a complex operation that you want to be able to timeout > > or cancel, but the process of cleanup / cancelling might also require > > a certain amount of time that you'd want to allow time for (likely a > > smaller time in normal circumstances). Then it seems like you'd want > > to be able to allocate a separate timeout for the clean-up portion > > (independent of the timeout allotted for the original operation). > > > > It's not clear to me how this case would best be handled with the > > primitives you described. In your text above ("then any future calls > > should fail immediately"), without any changes, it seems there > > wouldn't be "time" for any clean-up to complete. > > > > With asyncio, one way to handle this is to await on a task with a > > smaller timeout after calling task.cancel(). That lets you assign a > > different timeout to waiting for cancellation to complete. > > You can get these semantics using the "shielding" feature, which the > post discusses a bit later: > > try: > await do_some_stuff() > finally: > # Always give this 30 seconds to clean up, even if we've > # been cancelled > with trio.move_on_after(30) as cscope: > cscope.shield = True > await do_cleanup() > > Here the inner scope "hides" the code inside it from any external > cancel scopes, so it can continue executing even of the overall > context has been cancelled. > > However, I think this is probably a code smell. Like all code smells, > there are probably cases where it's the right thing to do, but when > you see it you should stop and think carefully. If you're writing code > like this, then it means that there are multiple different layers in > your code that are implementing timeout policies, that might end up > fighting with each other. What if the caller really needs this to > finish in 15 seconds? So if you have some way to move the timeout > handling into the same layer, then I suspect that will make your > program easier to understand and maintain. OTOH, if you decide you > want it, the code above works :-). I'm not 100% sure here; I'd > definitely be interested to hear about more use cases. > > One thing I've thought about that might help is adding a kind of "soft > cancelled" state to the cancel scopes, inspired by the "graceful > shutdown" mode that you'll often see in servers where you stop > accepting new connections, then try to finish up old ones (with some > time limit). So in this case you might mark 'do_some_stuff()' as being > cancelled immediately when we entered the 'soft cancel' phase, but let > the 'do_cleanup' code keep running until the grace period expired and > the region was hard-cancelled. This idea isn't fully baked yet though. > (There's some more mumbling about this at > https://github.com/python-trio/trio/issues/147.) > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dimaqq at gmail.com Sun Jan 14 21:33:20 2018 From: dimaqq at gmail.com (Dima Tisnek) Date: Mon, 15 Jan 2018 10:33:20 +0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: I suppose the websocket case ought to follow conventions similar to kernel TCP API where `close` returns immediately but continues to send packets behind the scenes. It could look something like this: with move_on_after(10): await get_ws_message(url): async def get_ws_message(url): async def close(): if sock and sock.is_connected and ...: await sock.send(build_close_packet()) await sock.recv() # or something if sock: sock.close() sock = socket.socket() try: await sock.connect(url) data = sock.recv(...) return decode(data) finally: with move_on_after(30): someio.spawn_tak(close()) I believe the concern is more general than supporting "broken" protocols, like websocket. When someone writes `with move_on_after(N): a = await foo()` it can be understood in two ways: * perform foo for N seconds or else, or * I want the result in N seconds or else The latter doesn't imply that foo should be interrupted, only that caller wishes to proceed without the result. It makes sense if the action involves an unrelated, long-running process, where `foo()` is something like `anext(some_async_generator)`. Both solve the original concern, that caller should not block for more than N. I suppose one can be implemented in terms of the other. Perhaps the latter is what `shield` should do? That is detach computation as opposed to blocking the caller past caller's deadline? What do you all think? On Mon, 15 Jan 2018 at 6:45 AM, Nick Badger wrote: > However, I think this is probably a code smell. Like all code smells, >> there are probably cases where it's the right thing to do, but when >> you see it you should stop and think carefully. > > > Huh. That's a really good point. But I'm not sure the source of the smell > is the code that needs the shield logic -- I think this might instead be > indicative of upstream code smell. Put a bit more concretely: if you're > writing a protocol for an unreliable network (and of course, every network > is unreliable), requiring a closure operation to transmit something over > that network is inherently problematic, because it inevitably leads to > multiple-stage timeouts or ungraceful shutdowns. > > Clearly, changing anything upstream is out of scope here. So if the smell > is, in fact, "upwind", there's not really much you could do about that in > asyncio, Curio, Trio, etc, other than minimize the additional smell you > need to accommodate smelly protocols. Unfortunately, I'm not sure there's > any one approach to that problem that isn't application-specific. > > > Nick Badger > https://www.nickbadger.com > > 2018-01-14 3:33 GMT-08:00 Nathaniel Smith : > >> On Fri, Jan 12, 2018 at 4:17 AM, Chris Jerdonek >> wrote: >> > Thanks, Nathaniel. Very instructive, thought-provoking write-up! >> > >> > One thing occurred to me around the time of reading this passage: >> > >> >> "Once the cancel token is triggered, then all future operations on >> that token are cancelled, so the call to ws.close doesn't get stuck. It's a >> less error-prone paradigm. ... If you follow the path we did in this blog >> post, and start by thinking about applying a timeout to a complex operation >> composed out of multiple blocking calls, then it's obvious that if the >> first call uses up the whole timeout budget, then any future calls should >> fail immediately." >> > >> > One case that's not clear how should be addressed is the following. >> > It's something I've wrestled with in the context of asyncio, and it >> > doesn't seem to be raised as a possibility in your write-up. >> > >> > Say you have a complex operation that you want to be able to timeout >> > or cancel, but the process of cleanup / cancelling might also require >> > a certain amount of time that you'd want to allow time for (likely a >> > smaller time in normal circumstances). Then it seems like you'd want >> > to be able to allocate a separate timeout for the clean-up portion >> > (independent of the timeout allotted for the original operation). >> > >> > It's not clear to me how this case would best be handled with the >> > primitives you described. In your text above ("then any future calls >> > should fail immediately"), without any changes, it seems there >> > wouldn't be "time" for any clean-up to complete. >> > >> > With asyncio, one way to handle this is to await on a task with a >> > smaller timeout after calling task.cancel(). That lets you assign a >> > different timeout to waiting for cancellation to complete. >> >> You can get these semantics using the "shielding" feature, which the >> post discusses a bit later: >> >> try: >> await do_some_stuff() >> finally: >> # Always give this 30 seconds to clean up, even if we've >> # been cancelled >> with trio.move_on_after(30) as cscope: >> cscope.shield = True >> await do_cleanup() >> >> Here the inner scope "hides" the code inside it from any external >> cancel scopes, so it can continue executing even of the overall >> context has been cancelled. >> >> However, I think this is probably a code smell. Like all code smells, >> there are probably cases where it's the right thing to do, but when >> you see it you should stop and think carefully. If you're writing code >> like this, then it means that there are multiple different layers in >> your code that are implementing timeout policies, that might end up >> fighting with each other. What if the caller really needs this to >> finish in 15 seconds? So if you have some way to move the timeout >> handling into the same layer, then I suspect that will make your >> program easier to understand and maintain. OTOH, if you decide you >> want it, the code above works :-). I'm not 100% sure here; I'd >> definitely be interested to hear about more use cases. >> >> One thing I've thought about that might help is adding a kind of "soft >> cancelled" state to the cancel scopes, inspired by the "graceful >> shutdown" mode that you'll often see in servers where you stop >> accepting new connections, then try to finish up old ones (with some >> time limit). So in this case you might mark 'do_some_stuff()' as being >> cancelled immediately when we entered the 'soft cancel' phase, but let >> the 'do_cleanup' code keep running until the grace period expired and >> the region was hard-cancelled. This idea isn't fully baked yet though. >> (There's some more mumbling about this at >> https://github.com/python-trio/trio/issues/147.) >> >> -n >> >> -- >> Nathaniel J. Smith -- https://vorpus.org >> _______________________________________________ >> Async-sig mailing list >> Async-sig at python.org >> https://mail.python.org/mailman/listinfo/async-sig >> Code of Conduct: https://www.python.org/psf/codeofconduct/ >> > > _______________________________________________ > Async-sig mailing list > Async-sig at python.org > https://mail.python.org/mailman/listinfo/async-sig > Code of Conduct: https://www.python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Jan 14 22:10:01 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 14 Jan 2018 19:10:01 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Sun, Jan 14, 2018 at 5:11 AM, Chris Jerdonek wrote: > On Sun, Jan 14, 2018 at 3:33 AM, Nathaniel Smith wrote: >> On Fri, Jan 12, 2018 at 4:17 AM, Chris Jerdonek >> wrote: >>> Say you have a complex operation that you want to be able to timeout >>> or cancel, but the process of cleanup / cancelling might also require >>> a certain amount of time that you'd want to allow time for (likely a >>> smaller time in normal circumstances). Then it seems like you'd want >>> to be able to allocate a separate timeout for the clean-up portion >>> (independent of the timeout allotted for the original operation). >>> ... >> >> You can get these semantics using the "shielding" feature, which the >> post discusses a bit later: >> ... >> However, I think this is probably a code smell. > > I agree with this assessment. My sense was that shielding could > probably do it, but it seems like it could be brittle or more of a > kludge. It would be nice if the same primitive could be used to > accommodate this and other variations in addition to the normal case. > For example, a related variation might be if you wanted to let > yourself extend the timeout in response to certain actions or results. > > The main idea that occurs to me is letting the cancel scope be > dynamic: the timeout could be allowed to change in response to certain > things. Something like that seems like it has the potential to be both > simple as well as general enough to accommodate lots of different > scenarios, including adjusting the timeout in response to entering a > clean-up phase. One good test would be whether shielding could be > implemented using such a primitive. Ah, if you want to change the timeout on a specific cancel scope, that's easy: async def do_something(): with move_on_after(10) as cscope: ... # Actually, let's give ourselves a bit more time cscope.deadline += 10 ... If you have a reference to a Trio cancel scope, you can change its timeout at any time. However, this is different from shielding. The code above only changes the deadline for that particular cancel scope. If the caller sets their own timeout: with move_on_after(15): await do_something() then the code will still get cancelled after 15 seconds when the outer cancel scope's deadline expires, even though the inner scope ended up with a 20 second timeout. Shielding is about disabling outer cancel scopes -- the ones you don't know about! -- in a particular bit of code. (If you compare to C#'s cancellation sources or Golang's context-based cancellation, it's like writing a function that intentionally choose not to pass through the cancel token it was given into some function it calls.) -n -- Nathaniel J. Smith -- https://vorpus.org From njs at pobox.com Sun Jan 14 23:52:19 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 14 Jan 2018 20:52:19 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Sun, Jan 14, 2018 at 2:45 PM, Nick Badger wrote: >> However, I think this is probably a code smell. Like all code smells, >> there are probably cases where it's the right thing to do, but when >> you see it you should stop and think carefully. > > Huh. That's a really good point. But I'm not sure the source of the smell is > the code that needs the shield logic -- I think this might instead be > indicative of upstream code smell. Put a bit more concretely: if you're > writing a protocol for an unreliable network (and of course, every network > is unreliable), requiring a closure operation to transmit something over > that network is inherently problematic, because it inevitably leads to > multiple-stage timeouts or ungraceful shutdowns. I wouldn't go that far -- there are actually good reasons to design protocols like this. SSL/TLS is a protocol that has a "goodbye" message (they call it "close-notify"). According to the spec [1], sending this is mandatory if you want to cleanly shut down an SSL/TLS connection. Why? Well, say I send you a message, "Should I buy more bitcoin?" and your reply is "Yes, but only if the price drops below $XX". Unbeknownst to us, we're being MITMed. Fortunately, we used SSL/TLS, so the MITM can't alter what we're saying. But they can manipulate the network; for example, they could cause our connection to drop after the first 3 bytes of your message, so your answer gets truncated and I think you just said "Yes" -- which is very different! But, close-notify saves us -- or at least contains the damage. Since I know that you're supposed to send a close-notify at the end of your connection, and I didn't get one, I can tell that this is a truncated message. I can't tell what the rest was going to be, but at least I know the message I got isn't the message you intended to send. And an attacker can't forge a close-notify message, because they're cryptographically authenticated like all the data we send. In websockets, the goodbye handshake is used to work around a nasty case that can happen with common TCP stacks (like, all of them): 1. A sends a message to B. 2. A is done after that, so it closes the connection. 3. Just then, B sends a message to A, like maybe a regular ping on some timer. 4. A's TCP stack receives data on a closed connection, goes "huh wut?", and sends a RST packet. 5. B goes to read the last message A sent before they closed the connection... but whoops it's gone! the RST packet caused both TCP stacks to wipe out all their buffered data associated with this connection. So if you have a protocol that's used for streaming indefinite amounts of data in both directions and supports stuff like pings, you kind of have to have a goodbye handshake to avoid TCP stacks accidentally corrupting your data. (The goodbye handshake can also help make sure that clients end up carrying CLOSE-WAIT states instead of servers, but that's a finicky and less important issue.) Of course, it is absolutely true that networks are unreliable, so when your protocol specifies a goodbye handshake like this then implementations still need to have some way to cope if their peer closes the connection unexpectedly, and they may need to unilaterally close the connection in some circumstances no matter what the spec says. Correctly handling every possible case here quickly becomes, like, infinitely complicated. But nonetheless, as a library author one has to try to provide some reasonable behavior by default (while knowing that some users will end up needing to tweak things to handle special circumstances). My tentative approach so far in Trio is (a) make cancellation stateful like discussed in the blog post, because accidentally hanging forever just can't be a good default, (b) in the "trio.abc.AsyncResource" interface that complex objects like trio.SSLStream implement (and we recommend libraries implement too), the semantics for the aclose and __aexit__ methods are that they're allowed to block forever trying to do a graceful shutdown, but if cancelled then they have to return promptly *but still freeing any underlying resources*, possibly in a non-graceful way. So if you write straightforward code like: with trio.move_on_after(10): async with open_websocket_connection(...): ... then it tries to do a proper websocket goodbye handshake by default, but if the timeout expires then it gives up and immediately closes the socket. It's not perfect, but it seems like a better default than anything else I can think of. -n [1] There's also this whole mess where many SSL/TLS implementations ignore the spec and don't bother sending close-notify. This is *kinda* justifiable because the original and most popular use for SSL/TLS is for wrapping HTTP connections, and HTTP has its own ways of signaling the end of the connection that are already transmitted through the encrypted tunnel, so the SSL/TLS end-of-connection handshake is redundant. Therefore lots of implementations went ahead and ignored the spec (including Python's ssl module!), so now if you're implementing HTTPS you have to do the same for interoperability. But the SSL/TLS spec can't assume you're using HTTP on top: it's contract is basically "socket semantics, but cryptographically authenticated". And close() is part of socket semantics, so it kind of has to make close() cryptographically authenticated too. (trio.SSLStream handles this by implementing the standard compliant behavior by default, but you can pass https_compatible=True to the constructor to get the HTTPS-style behavior.) -- Nathaniel J. Smith -- https://vorpus.org From nbadger1 at gmail.com Mon Jan 15 01:08:26 2018 From: nbadger1 at gmail.com (Nick Badger) Date: Sun, 14 Jan 2018 22:08:26 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: Quick preface: there are definitely times when code "smell" really isn't -- nothing's perfect! -- and sometimes some system component is unavoidably inelegant. I think this is oftentimes (but not always) the result of scoping: clearly I couldn't decide, as a library author, that "it's all just broken" and rip out everything from OS to TCP to language syntax and semantics just to make my API prettier. So I pragmatically downscope the problem space, and it forces me to make design decisions to accommodate the rest of the universe. And that's okay! With that being said, I'm still not convinced that the double-timeout-shutdown isn't an indication of upstream code smell. From a practical standpoint, for the purposes of this discussion it really doesn't matter; Trio et al can't go mucking about in the TCP stack internals, so we do the best we can. But I'm willing to entertain the possibility (actually I think it's highly likely) that there are better solutions to the aforementioned problems than the ones used by (for example) TCP and TLS. But that rabbit hole goes very, very deep, so to circle back, what I'm trying to say is this: - I share the inclination that shielding against cancellation (or any equivalent workaround) is likely code smell - However, I personally suspect the source of that smell is upstream, in the network protocols themselves - Given that, I think some amount of smell in downstream libraries like Trio is unavoidable To that end, I really like Trio's existing approach. Shielding should definitely be used sparingly, but I think it's a justifiable, pragmatic compromise when it comes to dealing with not-quite-perfect protocols on even-less-perfect networks. And I think the connection close semantics Trio provides for these situations -- attempt to close gracefully, but if cancelled, still close unilaterally to free local resources -- is an excellent approach. But it also "lucks out" a bit, because freeing local resources is many orders of magnitude faster than the enclosing timeout is likely to be, so it's effectively a "free" operation. The relative timescales are a critical observation; if freeing local resources took one second out of a ten-second timeout, I think you'd be stuck asking the same question there, too. Nick Badger https://www.nickbadger.com 2018-01-14 20:52 GMT-08:00 Nathaniel Smith : > On Sun, Jan 14, 2018 at 2:45 PM, Nick Badger wrote: > >> However, I think this is probably a code smell. Like all code smells, > >> there are probably cases where it's the right thing to do, but when > >> you see it you should stop and think carefully. > > > > Huh. That's a really good point. But I'm not sure the source of the > smell is > > the code that needs the shield logic -- I think this might instead be > > indicative of upstream code smell. Put a bit more concretely: if you're > > writing a protocol for an unreliable network (and of course, every > network > > is unreliable), requiring a closure operation to transmit something over > > that network is inherently problematic, because it inevitably leads to > > multiple-stage timeouts or ungraceful shutdowns. > > I wouldn't go that far -- there are actually good reasons to design > protocols like this. > > SSL/TLS is a protocol that has a "goodbye" message (they call it > "close-notify"). According to the spec [1], sending this is mandatory > if you want to cleanly shut down an SSL/TLS connection. Why? Well, say > I send you a message, "Should I buy more bitcoin?" and your reply is > "Yes, but only if the price drops below $XX". Unbeknownst to us, we're > being MITMed. Fortunately, we used SSL/TLS, so the MITM can't alter > what we're saying. But they can manipulate the network; for example, > they could cause our connection to drop after the first 3 bytes of > your message, so your answer gets truncated and I think you just said > "Yes" -- which is very different! But, close-notify saves us -- or at > least contains the damage. Since I know that you're supposed to send a > close-notify at the end of your connection, and I didn't get one, I > can tell that this is a truncated message. I can't tell what the rest > was going to be, but at least I know the message I got isn't the > message you intended to send. And an attacker can't forge a > close-notify message, because they're cryptographically authenticated > like all the data we send. > > In websockets, the goodbye handshake is used to work around a nasty > case that can happen with common TCP stacks (like, all of them): > > 1. A sends a message to B. > 2. A is done after that, so it closes the connection. > 3. Just then, B sends a message to A, like maybe a regular ping on some > timer. > 4. A's TCP stack receives data on a closed connection, goes "huh > wut?", and sends a RST packet. > 5. B goes to read the last message A sent before they closed the > connection... but whoops it's gone! the RST packet caused both TCP > stacks to wipe out all their buffered data associated with this > connection. > > So if you have a protocol that's used for streaming indefinite amounts > of data in both directions and supports stuff like pings, you kind of > have to have a goodbye handshake to avoid TCP stacks accidentally > corrupting your data. (The goodbye handshake can also help make sure > that clients end up carrying CLOSE-WAIT states instead of servers, but > that's a finicky and less important issue.) > > Of course, it is absolutely true that networks are unreliable, so when > your protocol specifies a goodbye handshake like this then > implementations still need to have some way to cope if their peer > closes the connection unexpectedly, and they may need to unilaterally > close the connection in some circumstances no matter what the spec > says. Correctly handling every possible case here quickly becomes, > like, infinitely complicated. But nonetheless, as a library author one > has to try to provide some reasonable behavior by default (while > knowing that some users will end up needing to tweak things to handle > special circumstances). > > My tentative approach so far in Trio is (a) make cancellation stateful > like discussed in the blog post, because accidentally hanging forever > just can't be a good default, (b) in the "trio.abc.AsyncResource" > interface that complex objects like trio.SSLStream implement (and we > recommend libraries implement too), the semantics for the aclose and > __aexit__ methods are that they're allowed to block forever trying to > do a graceful shutdown, but if cancelled then they have to return > promptly *but still freeing any underlying resources*, possibly in a > non-graceful way. So if you write straightforward code like: > > with trio.move_on_after(10): > async with open_websocket_connection(...): > ... > > then it tries to do a proper websocket goodbye handshake by default, > but if the timeout expires then it gives up and immediately closes the > socket. It's not perfect, but it seems like a better default than > anything else I can think of. > > -n > > [1] There's also this whole mess where many SSL/TLS implementations > ignore the spec and don't bother sending close-notify. This is *kinda* > justifiable because the original and most popular use for SSL/TLS is > for wrapping HTTP connections, and HTTP has its own ways of signaling > the end of the connection that are already transmitted through the > encrypted tunnel, so the SSL/TLS end-of-connection handshake is > redundant. Therefore lots of implementations went ahead and ignored > the spec (including Python's ssl module!), so now if you're > implementing HTTPS you have to do the same for interoperability. But > the SSL/TLS spec can't assume you're using HTTP on top: it's contract > is basically "socket semantics, but cryptographically authenticated". > And close() is part of socket semantics, so it kind of has to make > close() cryptographically authenticated too. (trio.SSLStream handles > this by implementing the standard compliant behavior by default, but > you can pass https_compatible=True to the constructor to get the > HTTPS-style behavior.) > > -- > Nathaniel J. Smith -- https://vorpus.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Jan 15 19:41:36 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 16 Jan 2018 01:41:36 +0100 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans References: Message-ID: <20180116014136.1985193e@fsol> Hi, On Thu, 11 Jan 2018 02:09:29 -0800 Nathaniel Smith wrote: > Hi all, > > Folks here might be interested in this new blog post: > > https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ > > It's a detailed discussion of pitfalls and design-tradeoffs in APIs > for timeout and cancellation, and has a proposal for handling them in > a more Pythonic way. Any feedback welcome! I have little constructive feedback to share, other than it is a very insightful write-up and the API proposal there is quite interesting. cheers, Antoine. From njs at pobox.com Tue Jan 16 04:56:33 2018 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Jan 2018 01:56:33 -0800 Subject: [Async-sig] Blog post: Timeouts and cancellation for humans In-Reply-To: References: Message-ID: On Sun, Jan 14, 2018 at 6:33 PM, Dima Tisnek wrote: > Perhaps the latter is what `shield` should do? That is detach computation as > opposed to blocking the caller past caller's deadline? Well, it can't do that in trio :-). One of trio's core design principles is: no detached processes. And even if you don't think detached processes are inherently a bad idea, I don't think they're what you'd want in this case anyway. If your socket shutdown code has frozen, you want to kill it and close the socket, not move it into the background where it can hang around indefinitely wasting resources. -n -- Nathaniel J. Smith -- https://vorpus.org