[Python-ideas] Deterministic iterator cleanup

Tue Oct 25 18:48:53 EDT 2016

...Doh. I spent all that time evaluating the function-scoped-cleanup
proposal from the high-level design perspective, and then immediately
after hitting send, I suddenly realized that I'd missed a much more
straightforward technical problem.

One thing that 'with' blocks / for-scoped-iterclose do is that they
put an upper bound on the lifetime of generator objects. That's
important if you're using a non-refcounting-GC, or if there might be
reference cycles. But it's not all they do: they also arrange to make
sure that any cleanup code is executed in the context of the code
that's using the generator. This is *also* really important: if you
have an exception in your cleanup code, and the GC runs your cleanup
code, then that exception will just disappear into nothingness (well,
it'll get printed to the console, but that's hardly better). So you
don't want to let the GC run your cleanup code. If you have an async
generator, you want to run the cleanup code under supervision of the
calling functions coroutine runner, and ideally block the running
coroutine while you do it; doing this from the GC is
difficult-to-impossible (depending on how picky you are -- PEP 525
does part of it, but not all). Again, letting the GC get involved is
bad.

So for the function-scoped-iterclose proposal: does this implicit
ExitStack-like object take a strong reference to iterators, or just a
weak one?

If it takes a strong reference, then suddenly we're pinning all
iterators in memory until the end of the enclosing function, which
will often look like a memory leak. I think this would break a *lot*
more existing code than the for-scoped-iterclose proposal does, and in
more obscure ways that are harder to detect and warn about ahead of
time. So that's out.

If it takes a weak reference, ... then there's a good chance that
iterators will get garbage collected before the ExitStack has a chance
to clean them up properly. So we still have no guarantee that the
cleanup will happen in the right context, that exceptions will not be
lost, and so forth. In fact, it becomes literally non-deterministic:
you might see an exception propagate properly on one run, and not on
the next, depending on exactly when the garbage collector happened to
run.

IMHO that's *way* too spooky to be allowed, but I can't see any way to
fix it within the function-scoping framework :-(

-n

On Tue, Oct 25, 2016 at 3:25 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> On 20 October 2016 at 07:02, Nathaniel Smith <njs at pobox.com> wrote:
>>> The first change is to replace the outer for loop with a while/pop
>>> loop, so that if an exception occurs we'll know which iterables remain
>>> to be processed:
>>>
>>> def chain(*iterables):
>>>     try:
>>>         while iterables:
>>>             for element in iterables.pop(0):
>>>                 yield element
>>>     ...
>>>
>>> Now, what do we do if an exception does occur? We need to call
>>> iterclose on all of the remaining iterables, but the tricky bit is
>>> that this might itself raise new exceptions. If this happens, we don't
>>> want to abort early; instead, we want to continue until we've closed
>>> all the iterables, and then raise a chained exception. Basically what
>>> we want is:
>>>
>>> def chain(*iterables):
>>>     try:
>>>         while iterables:
>>>             for element in iterables.pop(0):
>>>                 yield element
>>>     finally:
>>>         try:
>>>             operators.iterclose(iter(iterables[0]))
>>>         finally:
>>>             try:
>>>                 operators.iterclose(iter(iterables[1]))
>>>             finally:
>>>                 try:
>>>                     operators.iterclose(iter(iterables[2]))
>>>                 finally:
>>>                     ...
>>>
>>> but of course that's not valid syntax. Fortunately, it's not too hard
>>> to rewrite that into real Python -- but it's a little dense:
>>>
>>> def chain(*iterables):
>>>     try:
>>>         while iterables:
>>>             for element in iterables.pop(0):
>>>                 yield element
>>>     # This is equivalent to the nested-finally chain above:
>>>     except BaseException as last_exc:
>>>         for iterable in iterables:
>>>             try:
>>>                 operators.iterclose(iter(iterable))
>>>             except BaseException as new_exc:
>>>                 if new_exc.__context__ is None:
>>>                     new_exc.__context__ = last_exc
>>>                 last_exc = new_exc
>>>         raise last_exc
>>>
>>> It's probably worth wrapping that bottom part into an iterclose_all()
>>> helper, since the pattern probably occurs in other cases as well.
>>> (Actually, now that I think about it, the map() example in the text
>>> should be doing this instead of what it's currently doing... I'll fix
>>> that.)
>>
>> At this point your code is starting to look a whole lot like the code
>> in contextlib.ExitStack.__exit__ :)
>
> One of the versions I tried but didn't include in my email used
> ExitStack :-). It turns out not to work here: the problem is that we
> effectively need to enter *all* the contexts before unwinding, even if
> trying to enter one of them fails. ExitStack is nested like (try (try
> (try ... finally) finally) finally), and we need (try finally (try
> finally (try finally ...))) But this is just a small side-point
> anyway, since most code is not implementing complicated
> meta-iterators; I'll address your real proposal below.
>
>> Accordingly, I'm going to suggest that while I agree the problem you
>> describe is one that genuinely emerges in large production
>> applications and other complex systems, this particular solution is
>> simply far too intrusive to be accepted as a language change for
>> Python - you're talking a fundamental change to the meaning of
>> iteration for the sake of the relatively small portion of the
>> community that either work on such complex services, or insist on
>> writing their code as if it might become part of such a service, even
>> when it currently isn't. Given that simple applications vastly
>> outnumber complex ones, and always will, I think making such a change
>> would be a bad trade-off that didn't come close to justifying the
>> costs imposed on the rest of the ecosystem to adjust to it.
>>
>> A potentially more fruitful direction of research to pursue for 3.7
>> would be the notion of "frame local resources", where each Python
>> level execution frame implicitly provided a lazily instantiated
>> ExitStack instance (or an equivalent) for resource management.
>> Assuming that it offered an "enter_frame_context" function that mapped
>> to "contextlib.ExitStack.enter_context", such a system would let us do
>> things like:
>
> So basically a 'with expression', that gives up the block syntax --
> taking its scope from the current function instead -- in return for
> being usable in expression context? That's a really interesting, and I
> see the intuition that it might be less disruptive if our implicit
> iterclose calls are scoped to the function rather than the 'for' loop.
>
> But having thought about it and investigated some... I don't think
> function-scoping addresses my problem, and I don't see evidence that
> it's meaningfully less disruptive to existing code.
>
> First, "my problem":
>
> Obviously, Python's a language that should be usable for folks doing
> one-off scripts, and for paranoid folks trying to write robust complex
> systems, and for everyone in between -- these are all really important
> constituencies. And unfortunately, there is a trade-off here, where
> the changes we're discussing effect these constituencies differently.
> But it's not just a matter of shifting around a fixed amount of pain;
> the *quality* of the pain really changes under the different
> proposals.
>
> In the status quo:
> - for one-off scripts: you can just let the GC worry about generator
> and file handle cleanup, re-use iterators, whatever, it's cool
> - for robust systems: because it's the *caller's* responsibility to
> ensure that iterators are cleaned up, you... kinda can't really use
> generators without -- pick one -- (a) draconian style guides (like
> forbidding 'with' inside generators or forbidding bare 'for' loops
> entirely), (b) lots of auditing (every time you write a 'for' loop, go
> read the source to the generator you're iterating over -- no
> modularity for you and let's hope the answer doesn't change!), or (c)
> introducing really subtle bugs. Or all of the above. It's true that a
> lot of the time you can ignore this problem and get away with it one
> way or another, but if you're trying to write robust code then this
> doesn't really help -- it's like saying the footgun only has 1 bullet
> in the chamber. Not as reassuring as you'd think. It's like if every
> time you called a function, you had to explicitly say whether you
> wanted exception handling to be enabled inside that function, and if
> you forgot then the interpreter might just skip the 'finally' blocks
> while unwinding. There's just *isn't* a good solution available.
>
> In my proposal (for-scoped-iterclose):
> - for robust systems: life is great -- you're still stopping to think
> a little about cleanup every time you use an iterator (because that's
> what it means to write robust code!), but since the iterators now know
> when they need cleanup and regular 'for' loops know how to invoke it,
> then 99% of the time (i.e., whenever you don't intend to re-use an
> iterator) you can be confident that just writing 'for' will do exactly
> the right thing, and the other 1% of the time (when you do want to
> re-use an iterator), you already *know* you're doing something clever.
> So the cognitive overhead on each for-loop is really low.
> - for one-off scripts: ~99% of the time (actual measurement, see
> below) everything just works, except maybe a little bit better. 1% of
> the time, you deploy the clever trick of re-using an iterator with
> multiple for loops, and it breaks, so this is some pain. Here's what
> you see:
>
>     gen_obj = ...
>     for first_line in gen_obj:
>         break
>     for lines in gen_obj:
>         ...
>
>     Traceback (most recent call last):
>       File "/tmp/foo.py", line 5, in <module>
>         for lines in gen_obj:
>     AlreadyClosedIteratorError: this iterator was already closed,
> possibly by a previous 'for' loop. (Maybe you want
> itertools.preserve?)
>
> (We could even have a PYTHONDEBUG flag that when enabled makes that
> error message include the file:line of the previous 'for' loop that
> called __iterclose__.)
>
> So this is pain! But the pain is (a) rare, not pervasive, (b)
> immediately obvious (an exception, the code doesn't work at all), not
> subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
> is reliable. It's a totally different type of pain than the pain that
> we currently impose on folks who want to write robust code.
>
> Now compare to the new proposal (function-scoped-iterclose):
>
> - For those who want robust cleanup: Usually, I only need an iterator
> for as long as I'm iterating over it; that may or may not correspond
> to the end of the function (often won't). When these don't coincide,
> it can cause problems. E.g., consider the original example from my
> proposal:
>
>   def read_newline_separated_json(path):
>       with open(path) as f:
>           for line in f:
>               yield json.loads(line)
>
> but now suppose that I'm a Data Scientist (tm) so instead of having 1
> file full of newline-separated JSON, I have a 100 gigabytes worth of
> the stuff stored in lots of files in a directory tree. Well, that's no
> problem, I'll just wrap that generator:
>
>   def read_newline_separated_json_tree(tree):
>       for root, _, paths in os.walk(tree):
>           for path in paths:
>               for document in read_newline_separated_json(join(root, path)):
>                   yield document
>
> And then I'll run it on PyPy, because that's what you do when you have
> 100 GB of string processing, and... it'll crash, because the call to
> read_newline_separated_tree ends up doing thousands of calls to
> read_newline_separated_json, but never cleans up any of them up until
> the function exits, so eventually we run out of file descriptors.
>
> A similar situation arises in the main loop of something like an HTTP server:
>
>   while True:
>       request = read_request(sock)
>       for response_chunk in application_handler(request):
>           send_response_chunk(sock)
>
> Here we'll accumulate arbitrary numbers of un-closed
> application_handler generators attached to the stack frame, which is
> no good at all. And this has the interesting failure mode that you'll
> probably miss it in testing, because most clients will only re-use a
> connection a small number of times.
>
> So what this means is that every time I write a for loop, I can't just
> do a quick "am I going to break out of the for-loop and then re-use
> this iterator?" check -- I have to stop and think about whether this
> for-loop is nested inside some other loop, etc. And, again, if I get
> it wrong, then it's a subtle bug that will bite me later. It's true
> that with the status quo, we need to wrap, X% of for-loops with 'with'
> blocks, and with this proposal that number would drop to, I don't
> know, (X/5)% or something. But that's not the most important cost: the
> most important cost is the cognitive overhead of figuring out which
> for-loops need the special treatment, and in this proposal that
> checking is actually *more* complicated than the status quo.
>
> - For those who just want to write a quick script and not think about
> it: here's a script that does repeated partial for-loops over a
> generator object:
>
>     https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79
>
> (and note that the generator object even has an ineffective 'with
> open(...)' block inside it!)
>
> With the function-scoped-iterclose, this script would continue to work
> as it does now. Excellent.
>
> But, suppose that I decide that that main() function is really
> complicated and that it would be better to refactor some of those
> loops out into helper functions. (Probably actually true in this
> example.) So I do that and... suddenly the code breaks. And in a
> rather confusing way, because it has to do with this complicated
> long-distance interaction between two different 'for' loops *and*
> where they're placed with respect to the original function versus the
> helper function.
>
> If I were an intermediate-level Python student (and I'm pretty sure
> anyone who is starting to get clever with re-using iterators counts as
> "intermediate level"), then I'm pretty sure I'd actually prefer the
> immediate obvious feedback from the for-scoped-iterclose. This would
> actually be a good time to teach folks about this aspect of resource
> handling, actually -- it's certainly an important thing to learn
> eventually on your way to Python mastery, even if it isn't needed for
> every script.
>
> In the pypy-dev thread about this proposal, there's some very
> distressed emails from someone who's been writing Python for a long
> time but only just realized that generator cleanup relies on the
> garbage collector:
>
>     https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html
>     https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html
>
> It's unpleasant to have the rug pulled out from under you like this
> and suddenly realize that you might have to go re-evaluate all the
> code you've ever written, and making for loops safe-by-default and
> fail-fast-when-unsafe avoids that.
>
> Anyway, in summary: function-scoped-iterclose doesn't seem to
> accomplish my goal of getting rid of the *type* of pain involved when
> you have to run a background thread in your brain that's doing
> constant paranoid checking every time you write a for loop. Instead it
> arguably takes that type of pain and spreads it around both the
> experts and the novices :-/.
>
> -------------
>
> Now, let's look at some evidence about how disruptive the two
> proposals are for real code:
>
> As mentioned else-thread, I wrote a stupid little CPython hack [1] to
> report when the same iterator object gets passed to multiple 'for'
> loops, and ran the CPython and Django testsuites with it [2]. Looking
> just at generator objects [3], across these two large codebases there
> are exactly 4 places where this happens. (Rough idea of prevalence:
> these 4 places together account for a total of 8 'for' loops; this is
> out of a total of 11,503 'for' loops total, of which 665 involve
> generator objects.) The 4 places are:
>
> 1) CPython's Lib/test/test_collections.py:1135, Lib/_collections_abc.py:378
>
> This appears to be a bug in the CPython test suite -- the little MySet
> class does 'def __init__(self, itr): self.contents = itr', which
> assumes that itr is a container that can be repeatedly iterated. But a
> bunch of the methods on collections.abc.Set like to pass in a
> generator object here instead, which breaks everything. If repeated
> 'for' loops on generators raised an error then this bug would have
> been caught much sooner.
>
> 2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75
>
> Discussed above -- as written, for-scoped-iterclose would break this
> script, but function-scoped-iterclose would not, so here
> function-scoped-iterclose wins.
>
> 3) Django django/utils/regex_helper.py:236
>
> This code is very similar to the previous example in its general
> outline, except that the 'for' loops *have* been factored out into
> utility functions. So in this case for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive.
>
> 4) CPython's Lib/test/test_generators.py:723
>
> I have to admit I cannot figure out what this code is doing, besides
> showing off :-). But the different 'for' loops are in different stack
> frames, so I'm pretty sure that for-scoped-iterclose and
> function-scoped-iterclose would be equally disruptive.
>
> Obviously there's a bias here in that these are still relatively
> "serious" libraries; I don't have a big corpus of one-off scripts that
> are just a big __main__, though gprof2html.py isn't far from that. (If
> anyone knows where to find such a thing let me know...) But still, the
> tally here is that out of 4 examples, we have 1 subtle bug that
> iterclose might have caught, 2 cases where for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive, and only 1 where
> function-scoped-iterclose is less disruptive -- and in that case it's
> arguably just avoiding an obvious error now in favor of a more
> confusing error later.
>
> If this reduced the backwards-incompatible cases by a factor of, like,
> 10x or 100x, then that would be a pretty strong argument in its favor.
> But it seems to be more like... 1.5x.
>
> -n
>
> [1] https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2
> [2] CPython: revision b0a272709b from the github mirror; Django:
> revision 90c3b11e87
> [3] I also looked at "all iterators" and "all iterators with .close
> methods", but this email is long enough... basically the pattern is
> the same: there are another 13 'for' loops that involve repeated
> iteration over non-generator objects, and they're roughly equally
> split between spurious effects due to bugs in the CPython test-suite
> or my instrumentation, cases where for-scoped-iterclose and
> function-scoped-iterclose both cause the same problems, and cases
> where function-scoped-iterclose is less disruptive.
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org

-- 
Nathaniel J. Smith -- https://vorpus.org