[Async-sig] APIs for high-bandwidth large I/O?

Wed Oct 18 14:04:41 EDT 2017

Hi,

I am currently looking into ways to optimize large data transfers for a
distributed computing framework
(https://github.com/dask/distributed/).  We are using Tornado but the
question is more general, as it turns out that certain kinds of API are
an impediment to such optimizations.

To put things short, there are a couple benchmarks discussed here:
https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960

- for Tornado, this benchmark:
https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0
- for asyncio, this benchmark:
https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc

Both implement a trivial form of framing using the "preferred" APIs of
each framework (IOStream for Tornado, Protocol for asyncio), and then
benchmark it over 100 MB frames using a simple echo client/server.

The results (on Python 3.6) are interesting:
- vanilla asyncio achieves 350 MB/s
- vanilla Tornado achieves 400 MB/s
- asyncio + uvloop achieves 600 MB/s
- an optimized Tornado IOStream with a more sophisticated buffering
  logic (https://github.com/tornadoweb/tornado/pull/2166)
  achieves 700 MB/s

The latter result is especially interesting.  uvloop uses hand-crafted
Cython code + the C libuv library, still, a pure Python version of
Tornado does better thanks to an improved buffering logic in the
streaming layer.

Even the Tornado result is not ideal.  When profiling, we see that
50% of the runtime is actual IO calls (socket.send and socket.recv),
but the rest is still overhead.  Especially, buffering on the read side
still has costly memory copies (b''.join calls take 22% of the time!).

For a framed layer, you shouldn't need so many copies.  Once you've
read the frame length, you can allocate the frame upfront and read into
it.  It is at odds, however, with the API exposed by asyncio's Protocol:
data_received() gives you a new bytes object as soon as data arrives.
It's already too late: a spurious memory copy will have to occur.

Tornado's IOStream is less constrained, but it supports too many read
schemes (including several types of callbacks).  So I crafted a limited
version of IOStream (*) that supports little functionality, but is able
to use socket.recv_into() when asked for a given number of bytes.  When
benchmarked, this version achieves 950 MB/s. This is still without C
code!

(*) see
https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readinto?expand=1

When profiling that limited version of IOStream, we see that 68% of the
runtime is actual IO calls (socket.send and socket.recv_into).
Still, 21% of the total runtime is spent allocating a 100 MB buffer for
each frame!  That's 70% of the non-IO overhead!  Whether or not there
are smart ways to reuse that writable buffer depends on how the
application intends to use data: does it throw it away before the next
read or not?  It doesn't sound easily doable in the general case.

So I'm wondering which kind of APIs async libraries could expose to
make those use cases faster.  I know curio and trio have socket objects
which would probably fit the bill.  I don't know if there are
higher-level concepts that may be as adequate for achieving the highest
performance.

Also, since asyncio is the de facto standard now, I wonder if asyncio
might grow such a new API.  That may be troublesome: asyncio already
has Protocols and Streams, and people often complain about its
extensive API surface that's difficult for beginners :-)

Addendum: asyncio streams
-------------------------

I didn't think asyncio streams would be a good solution, but I still
wrote a benchmark variant for them out of curiosity, and it turns out I
was right.  The results:
- vanilla asyncio streams achieve 300 MB/s
- asyncio + uvloop streams achieve 550 MB/s

The benchmark script is at
https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7

Regards

Antoine.