High performance IO on non-blocking sockets

Fri Mar 14 12:45:32 EST 2003

On Fri, 14 Mar 2003, Troels Walsted Hansen wrote:

> I'm trying to do IO on non-blocking sockets (within the asyncore framework),
> and it seems to me that Python lacks a few primitives that would make this
> more efficient.
>
> Let's begin with socket writes. Assume that I have a string called self.data
> that I want to send on a non-blocking socket. This string can be anywhere
> from 1 to hundreds of megabytes.

Yes, although with a little analysis you can generally know whether your
write is closer to 1 byte or hundreds of megabytes, and take different
approaches accordingly - see below.

> This is a classic approach, seen in many Python examples:
>
>   sent = self.socket.send(self.data)
>   self.data = self.data[sent:]
>
> This approach is bad because self.data gets reallocated for every socket
> send. Worst case, 1 byte is sent each time and the realloc+copy cost goes
> through the roof.

Yes - to achieve truly high performance (BTW - how high do you need?) you
need to pay attention to what sort of writing you're doing. Is it
HTTP-like traffic or a custom protocol? How many simultaneous connections
do you need to support? Are they likely to be LAN-speed connections,
DSL-speed, modem, or some mix?

At the company I work for we have several different custom HTTP servers,
and we saw huge performance gains when we started grouping the types of
I/O according to size and acting on them differently. For example, in the
hundreds of megabytes (or even half a megabyte) cases, it's likely that
the data you're writing is coming off the disk. Our servers primarily run
on Linux, so we created a tiny C extension module that calls the sendfile
API and in cases where there's a large chunk of data coming off disk we
call sendfile so that the data never even makes it to Python (or our
process memory space, for that matter). On platforms without a sendfile C
API the call gets routed to a simulated sendfile (all Python) instead.

Anyway, with sendfile we hit some crazy performance levels - a PII (<500
MHz)  easily sustains 300 Mbps throughput for example for hundreds of
simultaneous DSL-like connections, and a PIII (~900 MHz) has passed 1.5
Gbps over the loopback adapter.

For our work, it's quite unlikely that we _ever_ send out 1 byte of
anything, but we do see lots of cases (like building HTTP response
headers) where there's lots of little chunks. In those situations we build
up a list of little strings, ''.join() them, and send them out as one
chunk.

One idea we've considered but not pursued is using the buffer() objects to
avoid the send-a-piece-then-copy-the-substring problem you identified.
We haven't gone down that path too far yet because sendfile has helped
immensely and we maintain our outgoing queues as lists of strings that we
keep as a list until right before sending, at which time we combine enough
of them to create a string large enough to fill the output buffer of the
socket, but (hopefully) not too much more.

One other thing: we stopped using asyncore/asynchat early on because it
was too low level and too slow for what we needed (although it works fine
for many other uses). We rolled our own asynch. socket framework and in
the process got to make something that works particularly well for HTTP
traffic.

> Now for recv operations on non-blocking sockets.

The recv side of things has always been slow (relatively speaking) for us,
second only to proxying (which, of course, relies directly on our recv
code), so I'd be really interested in any insights you have here!

> Assume that I want to read a known number of bytes (total_recv_size)
> from a socket and assemble the result as a Python string called
> self.data (again, think anywhere from 1 to hundreds of megabytes of
> data).

Again, though, the approach to how you read the data can benefit if you
can give hints on what you'll do with it.

For example, when we're proxying between two sockets we leave the data in
a list of chunks because our sending code can use it in that form anyway.
When we're receiving an upload, we don't really want a buffer the size of
the entire upload in memory anyway because we're going to be tossing the
data to disk.

Still though, I do wish there was a better way to do the receives because
even with leaving the data in a chunked list our proxying is slow and the
primary bottleneck appears to be the recv side of things.

>   self.data = []
> ...
>   # following code runs when socket is read-ready
>   recv_size = 64*1024 # for example
>   data = self.socket.recv(recv_size)
>   self.data.append(data)
> ...
>   self.data = ''.join(self.data)

This is the approach we use, except that we never do the final ''.join
(well, our framework doesn't, the application might if it makes sense)
because as a list the data is in suitable form for writing to disk or
handing off to the send code.

> Have I overlooked any better approaches?

Please let me know when you discover the magic solution. I'd sure like to
know what it is. :)

-Dave