[Python-ideas] Deterministic iterator cleanup

Wed Oct 19 17:02:18 EDT 2016

Hi Yury,

Thanks for the detailed comments! Replies inline below.

On Wed, Oct 19, 2016 at 8:51 AM, Yury Selivanov <yselivanov.ml at gmail.com> wrote:
> I'm -1 on the idea.  Here's why:
>
>
> 1. Python is a very dynamic language with GC and that is one of its
> fundamental properties.  This proposal might make GC of iterators more
> deterministic, but that is only one case.
>
> For instance, in some places in asyncio source code we have statements like
> this: "self = None".  Why?  When an exception occurs and we want to save it
> (for instance to log it), it holds a reference to the Traceback object.
> Which in turn references frame objects.  Which means that a lot of objects
> in those frames will be alive while the exception object is alive.  So in
> asyncio we go to great lengths to avoid unnecessary runs of GC, but this is
> an exception!  Most of Python code out there today doesn't do this sorts of
> tricks.
>
> And this is just one example of how you can have cycles that require a run
> of GC.  It is not possible to have deterministic GC in real life Python
> applications.  This proposal addresses only *one* use case, leaving 100s of
> others unresolved.

Maybe I'm misunderstanding, but I think those 100s of other cases
where you need deterministic cleanup are why 'with' blocks were
invented, and in my experience they work great for that. Once you get
in the habit, it's very easy and idiomatic to attach a 'with' to each
file handle, socket, etc., at the point where you create it. So from
where I stand, it seems like those 100s of unresolved cases actually
are resolved?

The problem is that 'with' blocks are great, and generators are great,
but when you put them together into the same language there's this
weird interaction that emerges, where 'with' blocks inside generators
don't really work for their intended purpose unless you're very
careful and willing to write boilerplate.

Adding deterministic cleanup to generators plugs this gap. Beyond
that, I do think it's a nice bonus that other iterables can take
advantage of the feature, but this isn't just a random "hey let's
smush two constructs together to save a line of code" thing --
iteration is special because it's where generator call stacks and
regular call stacks meet.

> IMO, while GC-related issues can be annoying to debug sometimes, it's not
> worth it to change the behaviour of iteration in Python only to slightly
> improve on this.
>
> 2. This proposal will make writing iterators significantly harder. Consider
> 'itertools.chain'.  We will have to rewrite it to add the proposed
> __iterclose__ method.  The Chain iterator object will have to track all of
> its iterators, call __iterclose__ on them when it's necessary (there are a
> few corner cases).  Given that this object is implemented in C, it's quite a
> bit of work.  And we'll have a lot of objects to fix.

When you say "make writing iterators significantly harder", is it fair
to say that you're thinking mostly of what I'm calling "iterator
wrappers"? For most day-to-day iterators, it's pretty trivial to
either add a close method or not; the tricky cases are when you're
trying to manage a collection of sub-iterators.

itertools.chain is a great challenge / test case here, because I think
it's about as hard as this gets :-). It took me a bit to wrap my head
around, but I think I've got it, and that it's not so bad actually.

Right now, chain's semantics are:

# copied directly from the docs
def chain(*iterables):
    for it in iterables:
        for element in it:
            yield element

In a post-__iterclose__ world, the inner for loop there will already
handle closing each iterators as its finished being consumed, and if
the generator is closed early then the inner for loop will also close
the current iterator. What we need to add is that if the generator is
closed early, we should also close all the unprocessed iterators.

The first change is to replace the outer for loop with a while/pop
loop, so that if an exception occurs we'll know which iterables remain
to be processed:

def chain(*iterables):
    try:
        while iterables:
            for element in iterables.pop(0):
                yield element
    ...

Now, what do we do if an exception does occur? We need to call
iterclose on all of the remaining iterables, but the tricky bit is
that this might itself raise new exceptions. If this happens, we don't
want to abort early; instead, we want to continue until we've closed
all the iterables, and then raise a chained exception. Basically what
we want is:

def chain(*iterables):
    try:
        while iterables:
            for element in iterables.pop(0):
                yield element
    finally:
        try:
            operators.iterclose(iter(iterables[0]))
        finally:
            try:
                operators.iterclose(iter(iterables[1]))
            finally:
                try:
                    operators.iterclose(iter(iterables[2]))
                finally:
                    ...

but of course that's not valid syntax. Fortunately, it's not too hard
to rewrite that into real Python -- but it's a little dense:

def chain(*iterables):
    try:
        while iterables:
            for element in iterables.pop(0):
                yield element
    # This is equivalent to the nested-finally chain above:
    except BaseException as last_exc:
        for iterable in iterables:
            try:
                operators.iterclose(iter(iterable))
            except BaseException as new_exc:
                if new_exc.__context__ is None:
                    new_exc.__context__ = last_exc
                last_exc = new_exc
        raise last_exc

It's probably worth wrapping that bottom part into an iterclose_all()
helper, since the pattern probably occurs in other cases as well.
(Actually, now that I think about it, the map() example in the text
should be doing this instead of what it's currently doing... I'll fix
that.)

This doesn't strike me as fundamentally complicated, really -- the
exception chaining logic makes it look scary, but basically it's just
the current chain() plus a cleanup loop. I believe that this handles
all the corner cases correctly. Am I missing something? And again,
this strikes me as one of the worst cases -- the vast majority of
iterators out there are not doing anything nearly this complicated
with subiterators.

> We can probably update all iterators in standard library (in 3.7), but what
> about third-party code?  It will take many years until you can say with
> certainty that most of Python code supports __iterclose__ / __aiterclose__.

Adding support to itertools, toolz.itertoolz, and generators (which
are the most common way to implement iterator wrappers) will probably
take care of 95% of uses, but yeah, there's definitely a long tail
that will take time to shake out. The (extremely tentative) transition
plan has __iterclose__ as opt-in until 3.9, so that's about 3.5 years
from now.

__aiterclose__ is a different matter of course, since there are very
very few async iterator wrappers in the wild, and in general I think
most people writing async iterators are watching async/await-related
language developments very closely.

> 3. This proposal changes the behaviour of 'for' and 'async for' statements
> significantly.  To do partial iteration you will have to use a special
> builtin function to guard the iterator from being closed.  This is
> completely non-obvious to any existing Python user and will be hard to
> explain to newcomers.

It's true that it's non-obvious to existing users, but that's true of
literally every change that we could ever make :-). That's why we have
release notes, deprecation warnings, enthusiastic blog posts, etc.

For newcomers... well, it's always difficult for those of us with more
experience to put ourselves back in the mindset, but I don't see why
this would be particularly difficult to explain? for loops consume
their iterator; if you don't want that then here's how you avoid it.
That's no more difficult to explain than what an iterator is in the
first place, I don't think, and for me at least it's a lot easier to
wrap my head around than the semantics of else blocks on for loops
:-). (I always forget how those work.)

> 4. This proposal only addresses iteration with 'for' and 'async for'
> statements.  If you iterate using a 'while' loop and 'next()' function, this
> proposal wouldn't help you.  Also see the point #2 about third-party code.

True. If you're doing manual iteration, then you are still responsible
for manual cleanup (if that's what you want), just like today. This
seems fine to me -- I'm not sure why it's an objection to this
proposal :-).

> 5. Asynchronous generators (AG) introduced by PEP 525 are finalized in a
> very similar fashion to synchronous generators.  There is an API to help
> Python to call event loop to finalize AGs.  asyncio in 3.6 (and other event
> loops in the near future) already uses this API to ensure that *all AGs in a
> long-running program are properly finalized* while it is being run.
>
> There is an extra loop method (`loop.shutdown_asyncgens`) that should be
> called right before stopping the loop (exiting the program) to make sure
> that all AGs are finalized, but if you forget to call it the world won't
> end.  The process will end and the interpreter will shutdown, maybe issuing
> a couple of ResourceWarnings.

There is no law that says that the interpreter always shuts down after
the event loop exits. We're talking about a fundamental language
feature here, it shouldn't be dependent on the details of libraries
and application shutdown tendencies :-(.

> No exception will pass silently in the current PEP 525 implementation.

Exceptions that occur inside a garbage-collected iterator will be
printed to the console, or possibly logged according to whatever the
event loop does with unhandled exceptions. And sure, that's better
than nothing, if someone remembers to look at the console/logs. But
they *won't* be propagated out to the containing frame, they can't be
caught, etc. That's a really big difference.

> And if some AG isn't properly finalized a warning will be issued.

This actually isn't true of the code currently in asyncio master -- if
the loop is already closed (either manually by the user or by its
__del__ being called) when the AG finalizer executes, then the AG is
silently discarded:
  https://github.com/python/asyncio/blob/e3fed68754002000be665ad1a379a747ad9247b6/asyncio/base_events.py#L352

This isn't really an argument against the mechanism though, just a bug
you should probably fix :-).

I guess it does point to my main dissatisfaction with the whole GC
hook machinery, though. At this point I have spent many, many hours
tracing through the details of this catching edge cases -- first
during the initial PEP process, where there were a few rounds of
revision, then again the last few days when I first thought I found a
bunch of bugs that turned out to be spurious because I'd missed one
line in the PEP, plus one real bug that you already know about (the
finalizer-called-from-wrong-thread issue), and then I spent another
hour carefully reading through the code again with PEP 442 open
alongside once I realized how subtle the resurrection and cyclic
reference issues are here, and now here's another minor bug for you.

At this point I'm about 85% confident that it does actually function
as described, or that we'll at least be able to shake out any
remaining weird edge cases over the next 6-12 months as people use it.
But -- and I realize this is an aesthetic reaction as much as anything
else -- this all feels *really* unpythonic to me. Looking at the Zen,
the phrases that come to mind are "complicated", and "If the
implementation is hard to explain, ...".

The __(a)iterclose__ proposal definitely has its complexity as well,
but it's a very different kind. The core is incredibly
straightforward: "there is this method, for loops always call it".
That's it. When you look at a for loop, you can be extremely confident
about what's going to happen and when. Of course then there's the
question of defining this method on all the diverse iterators that we
have floating around -- I'm not saying it's trivial. But you can take
them one at a time, and each individual case is pretty
straightforward.

> The current AG finalization mechanism must stay even if this proposal gets
> accepted, as it ensures that even manually iterated AGs are properly
> finalized.

Like I said in the text, I don't find this very persuasive, since if
you're manually iterating then you can just as well take manual
responsibility for cleaning things up. But I could live with both
mechanisms co-existing.

> 6. If this proposal gets accepted, I think we shouldn't introduce it in any
> form in 3.6.  It's too late to implement it for both sync- and
> async-generators.  Implementing it only for async-generators will only add
> cognitive overhead.  Even implementing this only for async-generators will
> (and should!) delay 3.6 release significantly.

I certainly don't want to delay 3.6. I'm not as convinced as you that
the async-generator code alone is so complicated that it would force a
delay, but if it is then 3.6.1 is also an option worth considering.

> 7. To conclude: I'm not convinced that this proposal fully solves the issue
> of non-deterministic GC of iterators.  It cripples iteration protocols to
> partially solve the problem for 'for' and 'async for' statements, leaving
> manual iteration unresolved.  It will make it harder to write *correct*
> (async-) iterators.  It introduces some *implicit* context management to
> 'for' and 'async for' statements -- something that IMO should be done by
> user with an explicit 'with' or 'async with'.

The goal isn't to "fully solve the problem of non-deterministic GC of
iterators". That would require magic :-). The goal is to provide tools
so that when users run into this problem, they have viable options to
solve it. Right now, we don't have those tools, as evidenced by the
fact that I've basically never seen code that does this "correctly".
We can tell people that they should be using explicit 'with' on every
generator that might contain cleanup code, but they don't and they
won't, and as a result their code quality is suffering on several axes
(portability across Python implementations, 'with' blocks inside
generators that don't actually do anything except spuriously hide
ResourceWarnings, etc.).

Adding __(a)iterclose__ to (async) for loops makes it easy and
convenient to do the right thing in common cases; and in the
less-usual case where you want to do manual iteration, then you can
and should use a manual 'with' block too. The proposal is not trying
to replace 'with' blocks :-).

As for implicitness, eh. If 'for' is defined to mean 'iterate and then
close', then that's what 'for' means. If we make the change then there
won't be anything more implicit about 'for' calling __iterclose__ than
there is about 'for' calling __iter__ or __next__. Definitely this
will take some adjustment for those who are used to the old system,
but sometimes that's the price of progress ;-).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org