[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Mon Sep 5 12:41:22 EDT 2016

On 5 September 2016 at 14:36, Steve Dower <steve.dower at python.org> wrote:
> The best fix is to use a buffered reader, which will read all the available
> bytes and then let you .read(1), even if it happens to be an incomplete
> character.

But this is sys.stdin.buffer.raw, we're talking about. People can't
really layer anything on top of that, it's precisely because they are
trying to *bypass* the existing layering (that doesn't work the way
that they need it to, because it blocks) that is the problem here.

> We could theoretically add buffering to the raw reader to handle one character,
> which would allow very small reads from raw, but that severely complicates
> things and the advice to use a buffered reader is good advice anyway.

Can you provide an example of how I'd rewrite the code that I quoted
previously to follow this advice? Note - this is not theoretical, I
expect to have to provide a PR to fix exactly this code should this
change go in. At the moment I can't find a way that doesn't impact the
(currently working and not expected to need any change) Unix version
of the code, most likely I'll have to add buffering of 4-byte reads
(which as you say is complex).

The problem I have is that we're forcing application code to do the
buffering to cater for Windows (where you're proposing that the raw IO
layer doesn't handle it and will potentially fail reads of <4 bytes).
Code written for POSIX doesn't need to do that, and the additional
maintenance overhead is potentially large enough to put POSIX
developers off adding the necessary code - this is in direct contrast
to the proposal to make fsencoding UTF-8 to make it easier for
POSIX-compatible code to "just work" on Windows.

If the goals are to handle Unicode correctly for stdin, and to work in
a way that POSIX-compatible code works without special effort on
Windows, then as far as I can see we have to handle the buffering of
partial reads of UTF-8 code sequences (because POSIX does so). If, on
the other hand, we just want Unicode to work on Windows, and we're not
looking for POSIX code to work without change, then the proposed
behaviour is OK (although I still maintain it needs to be flagged, as
it's very close to being a compatibility break in practice, even if
it's technically within the rules).

Paul

PS I'm not 100% sure that under POSIX read() will return partial UTF-8
byte sequences. I think it must, because otherwise a lot of code I've
seen would be broken, but if a POSIX expert can confirm or deny my
assumption, that would be great.