[Python-ideas] Deterministic iterator cleanup

Yury Selivanov yselivanov.ml at gmail.com
Wed Oct 19 17:52:34 EDT 2016


Nathaniel,

On 2016-10-19 5:02 PM, Nathaniel Smith wrote:

> Hi Yury,
>
> Thanks for the detailed comments! Replies inline below.

NP!

>
> On Wed, Oct 19, 2016 at 8:51 AM, Yury Selivanov <yselivanov.ml at gmail.com> wrote:
>> I'm -1 on the idea.  Here's why:
>>
>>
>> 1. Python is a very dynamic language with GC and that is one of its
>> fundamental properties.  This proposal might make GC of iterators more
>> deterministic, but that is only one case.
>>
>> For instance, in some places in asyncio source code we have statements like
>> this: "self = None".  Why?  When an exception occurs and we want to save it
>> (for instance to log it), it holds a reference to the Traceback object.
>> Which in turn references frame objects.  Which means that a lot of objects
>> in those frames will be alive while the exception object is alive.  So in
>> asyncio we go to great lengths to avoid unnecessary runs of GC, but this is
>> an exception!  Most of Python code out there today doesn't do this sorts of
>> tricks.
>>
>> And this is just one example of how you can have cycles that require a run
>> of GC.  It is not possible to have deterministic GC in real life Python
>> applications.  This proposal addresses only *one* use case, leaving 100s of
>> others unresolved.
> Maybe I'm misunderstanding, but I think those 100s of other cases
> where you need deterministic cleanup are why 'with' blocks were
> invented, and in my experience they work great for that. Once you get
> in the habit, it's very easy and idiomatic to attach a 'with' to each
> file handle, socket, etc., at the point where you create it. So from
> where I stand, it seems like those 100s of unresolved cases actually
> are resolved?

Not all code can be written with 'with' statements, see my example with 
'self = None' in asyncio.  Python code can be quite complex, involving 
classes with __del__ that do some cleanups etc. Fundamentally, you 
cannot make GC of such objects deterministic.

IOW I'm not convinced that if we implement your proposal we'll fix 90% 
(or even 30%) of cases where non-deterministic and postponed cleanup is 
harmful.
> The problem is that 'with' blocks are great, and generators are great,
> but when you put them together into the same language there's this
> weird interaction that emerges, where 'with' blocks inside generators
> don't really work for their intended purpose unless you're very
> careful and willing to write boilerplate.
>
> Adding deterministic cleanup to generators plugs this gap. Beyond
> that, I do think it's a nice bonus that other iterables can take
> advantage of the feature, but this isn't just a random "hey let's
> smush two constructs together to save a line of code" thing --
> iteration is special because it's where generator call stacks and
> regular call stacks meet.

Yes, I understand that your proposal really improves some things. OTOH 
it undeniably complicates the iteration protocol and requires a long 
period of deprecations, teaching users and library authors new 
semantics, etc.

We only now begin to see Python 3 gaining traction.  I don't want us to 
harm that by introducing another set of things to Python 3 that are 
significantly different from Python 2. DeprecationWarnings/future 
imports don't excite users either.

>> IMO, while GC-related issues can be annoying to debug sometimes, it's not
>> worth it to change the behaviour of iteration in Python only to slightly
>> improve on this.
>>
>> 2. This proposal will make writing iterators significantly harder. Consider
>> 'itertools.chain'.  We will have to rewrite it to add the proposed
>> __iterclose__ method.  The Chain iterator object will have to track all of
>> its iterators, call __iterclose__ on them when it's necessary (there are a
>> few corner cases).  Given that this object is implemented in C, it's quite a
>> bit of work.  And we'll have a lot of objects to fix.
> When you say "make writing iterators significantly harder", is it fair
> to say that you're thinking mostly of what I'm calling "iterator
> wrappers"? For most day-to-day iterators, it's pretty trivial to
> either add a close method or not; the tricky cases are when you're
> trying to manage a collection of sub-iterators.

Yes, mainly iterator wrappers.  You'll also will need to educate users 
to refactor (more on that below) their __del__ methods to 
__(a)iterclose__ in 3.6.
>
> itertools.chain is a great challenge / test case here, because I think
> it's about as hard as this gets :-). It took me a bit to wrap my head
> around, but I think I've got it, and that it's not so bad actually.
Now imagine that being applied throughout the stdlib, plus some of it 
will have to be implemented in C.  I'm not saying it's impossible, I'm 
saying that it will require additional effort for CPython and ecosystem.

[..]
>
>> 3. This proposal changes the behaviour of 'for' and 'async for' statements
>> significantly.  To do partial iteration you will have to use a special
>> builtin function to guard the iterator from being closed.  This is
>> completely non-obvious to any existing Python user and will be hard to
>> explain to newcomers.
> It's true that it's non-obvious to existing users, but that's true of
> literally every change that we could ever make :-). That's why we have
> release notes, deprecation warnings, enthusiastic blog posts, etc.

We don't often change the behavior of basic statements like 'for', if ever.

>
> For newcomers... well, it's always difficult for those of us with more
> experience to put ourselves back in the mindset, but I don't see why
> this would be particularly difficult to explain? for loops consume
> their iterator; if you don't want that then here's how you avoid it.
> That's no more difficult to explain than what an iterator is in the
> first place, I don't think, and for me at least it's a lot easier to
> wrap my head around than the semantics of else blocks on for loops
> :-). (I always forget how those work.)

A lot of code that you find on stackoverflow etc will be broken. Porting 
code from Python2/<3.6 will be challenging.  People are still struggling 
to understand 'dict.keys()'-like views in Python 3.

>
>> 4. This proposal only addresses iteration with 'for' and 'async for'
>> statements.  If you iterate using a 'while' loop and 'next()' function, this
>> proposal wouldn't help you.  Also see the point #2 about third-party code.
> True. If you're doing manual iteration, then you are still responsible
> for manual cleanup (if that's what you want), just like today. This
> seems fine to me -- I'm not sure why it's an objection to this
> proposal :-).

Right now we can implement the __del__ method to cleanup iterators. And 
it works for both partial iteration and cases where people forgot to 
close the iterator explicitly.

With you proposal, to achieve the same (and make the code compatible 
with new for-loop semantics), users will have to implement both 
__iterclose__ and __del__.

>
>> 5. Asynchronous generators (AG) introduced by PEP 525 are finalized in a
>> very similar fashion to synchronous generators.  There is an API to help
>> Python to call event loop to finalize AGs.  asyncio in 3.6 (and other event
>> loops in the near future) already uses this API to ensure that *all AGs in a
>> long-running program are properly finalized* while it is being run.
>>
>> There is an extra loop method (`loop.shutdown_asyncgens`) that should be
>> called right before stopping the loop (exiting the program) to make sure
>> that all AGs are finalized, but if you forget to call it the world won't
>> end.  The process will end and the interpreter will shutdown, maybe issuing
>> a couple of ResourceWarnings.
> There is no law that says that the interpreter always shuts down after
> the event loop exits. We're talking about a fundamental language
> feature here, it shouldn't be dependent on the details of libraries
> and application shutdown tendencies :-(.

It's not about shutting down the interpreter or exiting the process.  
The majority of async applications just run the loop until they exit.  
The point of PEP 525 and how the finalization is handled in asyncio is 
that AGs will be properly cleaned up for the absolute majority of time 
(while the loop is running).

[..]
>> And if some AG isn't properly finalized a warning will be issued.
> This actually isn't true of the code currently in asyncio master -- if
> the loop is already closed (either manually by the user or by its
> __del__ being called) when the AG finalizer executes, then the AG is
> silently discarded:
>    https://github.com/python/asyncio/blob/e3fed68754002000be665ad1a379a747ad9247b6/asyncio/base_events.py#L352
>
> This isn't really an argument against the mechanism though, just a bug
> you should probably fix :-).

I don't think it's a bug.  When the loop is closed, the hook will do 
nothing, so the asynchronous generator will be cleaned up by the 
interpreter.  If it has an 'await' expression in its 'finally' 
statement, the interpreter will issue a warning.

I'll add a comment explaining this.

>
> I guess it does point to my main dissatisfaction with the whole GC
> hook machinery, though. At this point I have spent many, many hours
> tracing through the details of this catching edge cases -- first
> during the initial PEP process, where there were a few rounds of
> revision, then again the last few days when I first thought I found a
> bunch of bugs that turned out to be spurious because I'd missed one
> line in the PEP, plus one real bug that you already know about (the
> finalizer-called-from-wrong-thread issue), and then I spent another
> hour carefully reading through the code again with PEP 442 open
> alongside once I realized how subtle the resurrection and cyclic
> reference issues are here, and now here's another minor bug for you.

Yes, I agree it's not an easy thing to digest.  Good thing is that 
asyncio has a reference implementation of PEP 525 support, so people can 
learn from it.  I'll definitely add more comments to make the code 
easier to read.

>
> At this point I'm about 85% confident that it does actually function
> as described, or that we'll at least be able to shake out any
> remaining weird edge cases over the next 6-12 months as people use it.
> But -- and I realize this is an aesthetic reaction as much as anything
> else -- this all feels *really* unpythonic to me. Looking at the Zen,
> the phrases that come to mind are "complicated", and "If the
> implementation is hard to explain, ...".
>
> The __(a)iterclose__ proposal definitely has its complexity as well,
> but it's a very different kind. The core is incredibly
> straightforward: "there is this method, for loops always call it".
> That's it. When you look at a for loop, you can be extremely confident
> about what's going to happen and when. Of course then there's the
> question of defining this method on all the diverse iterators that we
> have floating around -- I'm not saying it's trivial. But you can take
> them one at a time, and each individual case is pretty
> straightforward.

The __(a)iterclose__ semantics is clear.  What's not clear is how much 
harm changing the semantics of for-loops will do (and how to quantify 
the amount of good :))

[..]

>> 7. To conclude: I'm not convinced that this proposal fully solves the issue
>> of non-deterministic GC of iterators.  It cripples iteration protocols to
>> partially solve the problem for 'for' and 'async for' statements, leaving
>> manual iteration unresolved.  It will make it harder to write *correct*
>> (async-) iterators.  It introduces some *implicit* context management to
>> 'for' and 'async for' statements -- something that IMO should be done by
>> user with an explicit 'with' or 'async with'.
> The goal isn't to "fully solve the problem of non-deterministic GC of
> iterators". That would require magic :-). The goal is to provide tools
> so that when users run into this problem, they have viable options to
> solve it. Right now, we don't have those tools, as evidenced by the
> fact that I've basically never seen code that does this "correctly".
> We can tell people that they should be using explicit 'with' on every
> generator that might contain cleanup code, but they don't and they
> won't, and as a result their code quality is suffering on several axes
> (portability across Python implementations, 'with' blocks inside
> generators that don't actually do anything except spuriously hide
> ResourceWarnings, etc.).

Perhaps we should focus on teaching people that using 'with' statements 
inside (async-) generators is a bad idea.  What you should do instead is 
to have a 'with' statement wrapping the code that uses the generator.

Yury


More information about the Python-ideas mailing list