[Python-Dev] PEP 525, third round, better finalization

Sat Sep 3 11:42:41 EDT 2016

On 2 September 2016 at 19:13, Nathaniel Smith <njs at pobox.com> wrote:
> This works OK on CPython because the reference-counting gc will call
> handle.__del__() at the end of the scope (so on CPython it's at level
> 2), but it famously causes huge problems when porting to PyPy with
> it's much faster and more sophisticated gc that only runs when
> triggered by memory pressure. (Or for "PyPy" you can substitute
> "Jython", "IronPython", whatever.) Technically this code doesn't
> actually "leak" file descriptors on PyPy, because handle.__del__()
> will get called *eventually* (this code is at level 1, not level 0),
> but by the time "eventually" arrives your server process has probably
> run out of file descriptors and crashed. Level 1 isn't good enough. So
> now we have all learned to instead write
>
>  # good modern Python style:
>  def get_file_contents(path):
>       with open(path) as handle:
>           return handle.read()

This only works if the file fits in memory - otherwise you just have
to accept the fact that you need to leave the file handle open until
you're "done with the iterator", which means deferring the resource
management to the caller.

> and we have fancy tools like the ResourceWarning machinery to help us
> catch these bugs.
>
> Here's the analogous example for async generators. This is a useful,
> realistic async generator, that lets us incrementally read from a TCP
> connection that streams newline-separated JSON documents:
>
>   async def read_json_lines_from_server(host, port):
>       async for line in asyncio.open_connection(host, port)[0]:
>           yield json.loads(line)
>
> You would expect to use this like:
>
>   async for data in read_json_lines_from_server(host, port):
>       ...

The actual synchronous equivalent to this would look more like:

    def read_data_from_file(path):
        with open(path) as f:
            for line in f:
                yield f

(Assume we're doing something interesting to each line, rather than
reproducing normal file iteration behaviour)

And that has the same problem as your asynchronous example: the caller
needs to worry about resource management on the generator and do:

    with closing(read_data_from_file(path)) as itr:
        for line in itr:
            ...

Which means the problem causing your concern doesn't arise from the
generator being asynchronous - it comes from the fact the generator
actually *needs* to hold the FD open in order to work as intended (if
it didn't, then the code wouldn't need to be asynchronous).

> BUT, with the current PEP 525 proposal, trying to use this generator
> in this way is exactly analogous to the open(path).read() case: on
> CPython it will work fine -- the generator object will leave scope at
> the end of the 'async for' loop, cleanup methods will be called, etc.
> But on PyPy, the weakref callback will not be triggered until some
> arbitrary time later, you will "leak" file descriptors, and your
> server will crash.

That suggests the PyPy GC should probably be tracking pressure on more
resources than just memory when deciding whether or not to trigger a
GC run.

> For correct operation, you have to replace the
> simple 'async for' loop with this lovely construct:
>
>   async with aclosing(read_json_lines_from_server(host, port)) as ait:
>       async for data in ait:
>           ...
>
> Of course, you only have to do this on loops whose iterator might
> potentially hold resources like file descriptors, either currently or
> in the future. So... uh... basically that's all loops, I guess? If you
> want to be a good defensive programmer?

At that level of defensiveness in asynchronous code, you need to start
treating all external resources (including file descriptors) as a
managed pool, just as we have process and thread pools in the standard
library, and many database and networking libraries offer connection
pooling. It limits your per-process concurrency, but that limit exists
anyway at the operating system level - modelling it explicitly just
lets you manage how the application handles those limits.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia