Flush stdin

Mon Oct 20 20:30:13 EDT 2014

On Mon, Oct 20, 2014 at 4:18 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Dan Stromberg <drsalists at gmail.com>:
>> ...then everything acts line buffered, or perhaps even character
>> buffered [...]
>>
>> That, or we're using two different versions of netcat (there are at
>> least two available).
>
> Let's unconfuse the issue a bit. I'll take line buffering, netcat and
> the OS out of the picture.
>
> Here's a character generator (test.sh):
> ========================================================================
> while : ; do
>     echo -n x
>     sleep 1
> done
> ========================================================================
>
> and here's a character sink (test.py):
> ========================================================================
> import sys
> while True:
>     c = sys.stdin.read(1)
>     if not c:
>         break
>     print(ord(c[0]))
> ========================================================================
>
> Then, I run:
> ========================================================================
> $ bash ./test.sh | python3 ./test.py
> 120
> 120
> 120
> 120
> ========================================================================
>
> The lines are output at one-second intervals.
>
> That demonstrates that sys.stdin.read(1) does not block for more than
> one character. IOW, there is no buffering whatsoever.

Aren't character-buffered and unbuffered synonymous?

Often with TCP protocols, line buffered is preferred to character
buffered, both for performance and for simplicity: it doesn't suffer
from tinygrams (as much), and telnet becomes a useful test client.

Also, it's a straightforward way of framing your data, to avoid
getting messed up by Nagle or fragmentation.  One might find
http://stromberg.dnsalias.org/~strombrg/bufsock.html worth a glance.
It's buffered, but it keeps things framed, and doesn't fall prey to
tinygrams nearly as much as character buffering.

> If I change the sink a bit: "c = sys.stdin.read(5)", I get the same
> output but at five-second intervals indicating that sys.stdin.read()
> calls the underlying os.read() function five times before returning. In
> fact, that conclusion is made explicit by running:
>
> ========================================================================
> $ bash ./test.sh | strace python3 ./test.py
> ...
> read(0, "x", 4096)                      = 1
> read(0, "x", 4096)                      = 1
> read(0, "x", 4096)                      = 1
> read(0, "x", 4096)                      = 1
> read(0, "x", 4096)                      = 1
> fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3143bab000
> write(1, "120\n", 4120
> )                    = 4
> ...
========================================================================

This is tremendously inefficient.  It demands a context switch for
every character.

> If I modify test.py to call os.read():
> ========================================================================
> import os
> while True:
>     c = os.read(0, 5)
>     if not c:
>         break
>     print(ord(c[0]))
> ========================================================================
>
> The output is again printed at one-second intervals: no buffering.
>
> Thus, we are back at my suggestion: use os.read() if you don't want
> Python to buffer stdin for you.

It's true that Python won't buffer (or will be character-buffered)
then, but that takes some potentially-salient elements out of the
picture.  IOW, I don't think Python reading unbuffered is necessarily
the whole issue, and may even be going to far.

I have a habit of saying "necessary, but not necessarily sufficient",
but in this case I believe it's more of a "not necessarily necessary,
and not necessarily sufficient".  A lot depends on the other pieces of
the puzzle that you've chosen to "unconfuse" away.  Yes, you can make
Python unbuffered/character-buffered, but that's not the whole story.