[Python-ideas] Deterministic iterator cleanup

Fri Oct 28 03:11:10 EDT 2016

On Tuesday, October 25, 2016 at 6:26:17 PM UTC-4, Nathaniel Smith wrote:
>
> On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncog... at gmail.com 
> <javascript:>> wrote: 
> > On 20 October 2016 at 07:02, Nathaniel Smith <n... at pobox.com 
> <javascript:>> wrote: 
> >> The first change is to replace the outer for loop with a while/pop 
> >> loop, so that if an exception occurs we'll know which iterables remain 
> >> to be processed: 
> >> 
> >> def chain(*iterables): 
> >>     try: 
> >>         while iterables: 
> >>             for element in iterables.pop(0): 
> >>                 yield element 
> >>     ... 
> >> 
> >> Now, what do we do if an exception does occur? We need to call 
> >> iterclose on all of the remaining iterables, but the tricky bit is 
> >> that this might itself raise new exceptions. If this happens, we don't 
> >> want to abort early; instead, we want to continue until we've closed 
> >> all the iterables, and then raise a chained exception. Basically what 
> >> we want is: 
> >> 
> >> def chain(*iterables): 
> >>     try: 
> >>         while iterables: 
> >>             for element in iterables.pop(0): 
> >>                 yield element 
> >>     finally: 
> >>         try: 
> >>             operators.iterclose(iter(iterables[0])) 
> >>         finally: 
> >>             try: 
> >>                 operators.iterclose(iter(iterables[1])) 
> >>             finally: 
> >>                 try: 
> >>                     operators.iterclose(iter(iterables[2])) 
> >>                 finally: 
> >>                     ... 
> >> 
> >> but of course that's not valid syntax. Fortunately, it's not too hard 
> >> to rewrite that into real Python -- but it's a little dense: 
> >> 
> >> def chain(*iterables): 
> >>     try: 
> >>         while iterables: 
> >>             for element in iterables.pop(0): 
> >>                 yield element 
> >>     # This is equivalent to the nested-finally chain above: 
> >>     except BaseException as last_exc: 
> >>         for iterable in iterables: 
> >>             try: 
> >>                 operators.iterclose(iter(iterable)) 
> >>             except BaseException as new_exc: 
> >>                 if new_exc.__context__ is None: 
> >>                     new_exc.__context__ = last_exc 
> >>                 last_exc = new_exc 
> >>         raise last_exc 
> >> 
> >> It's probably worth wrapping that bottom part into an iterclose_all() 
> >> helper, since the pattern probably occurs in other cases as well. 
> >> (Actually, now that I think about it, the map() example in the text 
> >> should be doing this instead of what it's currently doing... I'll fix 
> >> that.) 
> > 
> > At this point your code is starting to look a whole lot like the code 
> > in contextlib.ExitStack.__exit__ :) 
>
> One of the versions I tried but didn't include in my email used 
> ExitStack :-). It turns out not to work here: the problem is that we 
> effectively need to enter *all* the contexts before unwinding, even if 
> trying to enter one of them fails. ExitStack is nested like (try (try 
> (try ... finally) finally) finally), and we need (try finally (try 
> finally (try finally ...))) But this is just a small side-point 
> anyway, since most code is not implementing complicated 
> meta-iterators; I'll address your real proposal below. 
>
> > Accordingly, I'm going to suggest that while I agree the problem you 
> > describe is one that genuinely emerges in large production 
> > applications and other complex systems, this particular solution is 
> > simply far too intrusive to be accepted as a language change for 
> > Python - you're talking a fundamental change to the meaning of 
> > iteration for the sake of the relatively small portion of the 
> > community that either work on such complex services, or insist on 
> > writing their code as if it might become part of such a service, even 
> > when it currently isn't. Given that simple applications vastly 
> > outnumber complex ones, and always will, I think making such a change 
> > would be a bad trade-off that didn't come close to justifying the 
> > costs imposed on the rest of the ecosystem to adjust to it. 
> > 
> > A potentially more fruitful direction of research to pursue for 3.7 
> > would be the notion of "frame local resources", where each Python 
> > level execution frame implicitly provided a lazily instantiated 
> > ExitStack instance (or an equivalent) for resource management. 
> > Assuming that it offered an "enter_frame_context" function that mapped 
> > to "contextlib.ExitStack.enter_context", such a system would let us do 
> > things like: 
>
> So basically a 'with expression', that gives up the block syntax -- 
> taking its scope from the current function instead -- in return for 
> being usable in expression context? That's a really interesting, and I 
> see the intuition that it might be less disruptive if our implicit 
> iterclose calls are scoped to the function rather than the 'for' loop. 
>
> But having thought about it and investigated some... I don't think 
> function-scoping addresses my problem, and I don't see evidence that 
> it's meaningfully less disruptive to existing code. 
>
> First, "my problem": 
>
> Obviously, Python's a language that should be usable for folks doing 
> one-off scripts, and for paranoid folks trying to write robust complex 
> systems, and for everyone in between -- these are all really important 
> constituencies. And unfortunately, there is a trade-off here, where 
> the changes we're discussing effect these constituencies differently. 
> But it's not just a matter of shifting around a fixed amount of pain; 
> the *quality* of the pain really changes under the different 
> proposals. 
>
> In the status quo: 
> - for one-off scripts: you can just let the GC worry about generator 
> and file handle cleanup, re-use iterators, whatever, it's cool 
> - for robust systems: because it's the *caller's* responsibility to 
> ensure that iterators are cleaned up, you... kinda can't really use 
> generators without -- pick one -- (a) draconian style guides (like 
> forbidding 'with' inside generators or forbidding bare 'for' loops 
> entirely), (b) lots of auditing (every time you write a 'for' loop, go 
> read the source to the generator you're iterating over -- no 
> modularity for you and let's hope the answer doesn't change!), or (c) 
> introducing really subtle bugs. Or all of the above. It's true that a 
> lot of the time you can ignore this problem and get away with it one 
> way or another, but if you're trying to write robust code then this 
> doesn't really help -- it's like saying the footgun only has 1 bullet 
> in the chamber. Not as reassuring as you'd think. It's like if every 
> time you called a function, you had to explicitly say whether you 
> wanted exception handling to be enabled inside that function, and if 
> you forgot then the interpreter might just skip the 'finally' blocks 
> while unwinding. There's just *isn't* a good solution available. 
>
> In my proposal (for-scoped-iterclose): 
> - for robust systems: life is great -- you're still stopping to think 
> a little about cleanup every time you use an iterator (because that's 
> what it means to write robust code!), but since the iterators now know 
> when they need cleanup and regular 'for' loops know how to invoke it, 
> then 99% of the time (i.e., whenever you don't intend to re-use an 
> iterator) you can be confident that just writing 'for' will do exactly 
> the right thing, and the other 1% of the time (when you do want to 
> re-use an iterator), you already *know* you're doing something clever. 
> So the cognitive overhead on each for-loop is really low. 
> - for one-off scripts: ~99% of the time (actual measurement, see 
> below) everything just works, except maybe a little bit better. 1% of 
> the time, you deploy the clever trick of re-using an iterator with 
> multiple for loops, and it breaks, so this is some pain. Here's what 
> you see: 
>
>     gen_obj = ... 
>     for first_line in gen_obj: 
>         break 
>     for lines in gen_obj: 
>         ... 
>
>     Traceback (most recent call last): 
>       File "/tmp/foo.py", line 5, in <module> 
>         for lines in gen_obj: 
>     AlreadyClosedIteratorError: this iterator was already closed, 
> possibly by a previous 'for' loop. (Maybe you want 
> itertools.preserve?) 
>
> (We could even have a PYTHONDEBUG flag that when enabled makes that 
> error message include the file:line of the previous 'for' loop that 
> called __iterclose__.) 
>
> So this is pain! But the pain is (a) rare, not pervasive, (b) 
> immediately obvious (an exception, the code doesn't work at all), not 
> subtle and delayed, (c) easily googleable, (d) easy to fix and the fix 
> is reliable. It's a totally different type of pain than the pain that 
> we currently impose on folks who want to write robust code. 
>
> Now compare to the new proposal (function-scoped-iterclose): 
>
> - For those who want robust cleanup: Usually, I only need an iterator 
> for as long as I'm iterating over it; that may or may not correspond 
> to the end of the function (often won't). When these don't coincide, 
> it can cause problems. E.g., consider the original example from my 
> proposal: 
>
>   def read_newline_separated_json(path): 
>       with open(path) as f: 
>           for line in f: 
>               yield json.loads(line) 
>
> but now suppose that I'm a Data Scientist (tm) so instead of having 1 
> file full of newline-separated JSON, I have a 100 gigabytes worth of 
> the stuff stored in lots of files in a directory tree. Well, that's no 
> problem, I'll just wrap that generator: 
>
>   def read_newline_separated_json_tree(tree): 
>       for root, _, paths in os.walk(tree): 
>           for path in paths: 
>               for document in read_newline_separated_json(join(root, 
> path)): 
>                   yield document 
>
>  

> And then I'll run it on PyPy, because that's what you do when you have 
> 100 GB of string processing, and... it'll crash, because the call to 
> read_newline_separated_tree ends up doing thousands of calls to 
> read_newline_separated_json, but never cleans up any of them up until 
> the function exits, so eventually we run out of file descriptors. 
>

I still don't understand why you can't write it like this:

def read_newline_separated_json_tree(tree):
    for root, _, paths in os.walk(tree):
        for path in paths:
            with read_newline_separated_json(join(root, path)) as iterable:
                yield from iterable

Zero extra lines.  Works today.  Does everything you want.

>
> A similar situation arises in the main loop of something like an HTTP 
> server: 
>
>   while True: 
>       request = read_request(sock) 
>       for response_chunk in application_handler(request): 
>           send_response_chunk(sock) 
>

Same thing:

while True:
    request = read_request(sock)
    with application_handler(request) as iterable:
        for response_chunk in iterable:
            send_response_chunk(sock)

I'll stop posting about this, but I don't see the motivation behind this 
proposals except replacing one explicit context management line with a 
hidden "line" of cognitive overhead.  I think the solution is to stop 
returning an iterable when you have state needing a cleanup.  Instead, 
return a context manager and force the caller to open it to get at the 
iterable.

Best,

Neil

>
> Here we'll accumulate arbitrary numbers of un-closed 
> application_handler generators attached to the stack frame, which is 
> no good at all. And this has the interesting failure mode that you'll 
> probably miss it in testing, because most clients will only re-use a 
> connection a small number of times.  

> So what this means is that every time I write a for loop, I can't just 
> do a quick "am I going to break out of the for-loop and then re-use 
> this iterator?" check -- I have to stop and think about whether this 
> for-loop is nested inside some other loop, etc. And, again, if I get 
> it wrong, then it's a subtle bug that will bite me later. It's true 
> that with the status quo, we need to wrap, X% of for-loops with 'with' 
> blocks, and with this proposal that number would drop to, I don't 
> know, (X/5)% or something. But that's not the most important cost: the 
> most important cost is the cognitive overhead of figuring out which 
> for-loops need the special treatment, and in this proposal that 
> checking is actually *more* complicated than the status quo. 
>
> - For those who just want to write a quick script and not think about 
> it: here's a script that does repeated partial for-loops over a 
> generator object: 
>
>     
> https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79 
>
> (and note that the generator object even has an ineffective 'with 
> open(...)' block inside it!) 
>
> With the function-scoped-iterclose, this script would continue to work 
> as it does now. Excellent. 
>
> But, suppose that I decide that that main() function is really 
> complicated and that it would be better to refactor some of those 
> loops out into helper functions. (Probably actually true in this 
> example.) So I do that and... suddenly the code breaks. And in a 
> rather confusing way, because it has to do with this complicated 
> long-distance interaction between two different 'for' loops *and* 
> where they're placed with respect to the original function versus the 
> helper function. 
>
> If I were an intermediate-level Python student (and I'm pretty sure 
> anyone who is starting to get clever with re-using iterators counts as 
> "intermediate level"), then I'm pretty sure I'd actually prefer the 
> immediate obvious feedback from the for-scoped-iterclose. This would 
> actually be a good time to teach folks about this aspect of resource 
> handling, actually -- it's certainly an important thing to learn 
> eventually on your way to Python mastery, even if it isn't needed for 
> every script. 
>
> In the pypy-dev thread about this proposal, there's some very 
> distressed emails from someone who's been writing Python for a long 
> time but only just realized that generator cleanup relies on the 
> garbage collector: 
>
>     https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html 
>     https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html 
>
> It's unpleasant to have the rug pulled out from under you like this 
> and suddenly realize that you might have to go re-evaluate all the 
> code you've ever written, and making for loops safe-by-default and 
> fail-fast-when-unsafe avoids that. 
>
> Anyway, in summary: function-scoped-iterclose doesn't seem to 
> accomplish my goal of getting rid of the *type* of pain involved when 
> you have to run a background thread in your brain that's doing 
> constant paranoid checking every time you write a for loop. Instead it 
> arguably takes that type of pain and spreads it around both the 
> experts and the novices :-/. 
>
> ------------- 
>
> Now, let's look at some evidence about how disruptive the two 
> proposals are for real code: 
>
> As mentioned else-thread, I wrote a stupid little CPython hack [1] to 
> report when the same iterator object gets passed to multiple 'for' 
> loops, and ran the CPython and Django testsuites with it [2]. Looking 
> just at generator objects [3], across these two large codebases there 
> are exactly 4 places where this happens. (Rough idea of prevalence: 
> these 4 places together account for a total of 8 'for' loops; this is 
> out of a total of 11,503 'for' loops total, of which 665 involve 
> generator objects.) The 4 places are: 
>
> 1) CPython's Lib/test/test_collections.py:1135, 
> Lib/_collections_abc.py:378 
>
> This appears to be a bug in the CPython test suite -- the little MySet 
> class does 'def __init__(self, itr): self.contents = itr', which 
> assumes that itr is a container that can be repeatedly iterated. But a 
> bunch of the methods on collections.abc.Set like to pass in a 
> generator object here instead, which breaks everything. If repeated 
> 'for' loops on generators raised an error then this bug would have 
> been caught much sooner. 
>
> 2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75 
>
> Discussed above -- as written, for-scoped-iterclose would break this 
> script, but function-scoped-iterclose would not, so here 
> function-scoped-iterclose wins. 
>
> 3) Django django/utils/regex_helper.py:236 
>
> This code is very similar to the previous example in its general 
> outline, except that the 'for' loops *have* been factored out into 
> utility functions. So in this case for-scoped-iterclose and 
> function-scoped-iterclose are equally disruptive. 
>
> 4) CPython's Lib/test/test_generators.py:723 
>
> I have to admit I cannot figure out what this code is doing, besides 
> showing off :-). But the different 'for' loops are in different stack 
> frames, so I'm pretty sure that for-scoped-iterclose and 
> function-scoped-iterclose would be equally disruptive. 
>
> Obviously there's a bias here in that these are still relatively 
> "serious" libraries; I don't have a big corpus of one-off scripts that 
> are just a big __main__, though gprof2html.py isn't far from that. (If 
> anyone knows where to find such a thing let me know...) But still, the 
> tally here is that out of 4 examples, we have 1 subtle bug that 
> iterclose might have caught, 2 cases where for-scoped-iterclose and 
> function-scoped-iterclose are equally disruptive, and only 1 where 
> function-scoped-iterclose is less disruptive -- and in that case it's 
> arguably just avoiding an obvious error now in favor of a more 
> confusing error later. 
>
> If this reduced the backwards-incompatible cases by a factor of, like, 
> 10x or 100x, then that would be a pretty strong argument in its favor. 
> But it seems to be more like... 1.5x. 
>
> -n 
>
> [1] 
> https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2 
> [2] CPython: revision b0a272709b from the github mirror; Django: 
> revision 90c3b11e87 
> [3] I also looked at "all iterators" and "all iterators with .close 
> methods", but this email is long enough... basically the pattern is 
> the same: there are another 13 'for' loops that involve repeated 
> iteration over non-generator objects, and they're roughly equally 
> split between spurious effects due to bugs in the CPython test-suite 
> or my instrumentation, cases where for-scoped-iterclose and 
> function-scoped-iterclose both cause the same problems, and cases 
> where function-scoped-iterclose is less disruptive. 
>
> -n 
>
> -- 
> Nathaniel J. Smith -- https://vorpus.org 
> _______________________________________________ 
> Python-ideas mailing list 
> Python... at python.org <javascript:> 
> https://mail.python.org/mailman/listinfo/python-ideas 
> Code of Conduct: http://python.org/psf/codeofconduct/ 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161028/49b2a4ee/attachment-0001.html>