[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Tue Sep 6 06:34:01 EDT 2016

On 5 September 2016 at 21:40, eryk sun <eryksun at gmail.com> wrote:
> On Mon, Sep 5, 2016 at 7:54 PM, Steve Dower <steve.dower at python.org> wrote:
>> On 05Sep2016 1234, eryk sun wrote:
>>> It would probably be simpler to use UTF-16 in the main pipeline and
>>> implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16
>>> buffer could be renamed as "wbuffer", for expert use. However, if
>>> you're fully committed to transcoding in the raw layer, I'm certain
>>> that these problems can be addressed with small buffers and using
>>> Python's codec machinery for a flexible mix of "surrogatepass" and
>>> "replace" error handling.
>>
>> I don't think it actually makes things simpler. Having two buffers is
>> generally a bad idea unless they are perfectly synced, which would be
>> impossible here without data corruption (if you read half a utf-8 character
>> sequence and then read the wide buffer, do you get that character or not?).
>
> Martin's idea, as I understand it, is a UTF-8 buffer that reads from
> and writes to the text wrapper.

Yes, that was basically it. Though I had only thought as far as simple
encodings like ASCII, where one byte corresponds to one character. I
wonder if you really need UTF-8 support. Are the encoding values
currently encountered for Windows consoles all single-byte encodings
or are they more complicated?

> It necessarily consumes at least one
> character and buffers it to allow reading per byte. Likewise for
> writing, it buffers bytes until it can write a character to the text
> wrapper. ISTM, it has to look for incomplete lead-continuation byte
> sequences at the tail end, to hold them until the sequence is
> complete, at which time it either decodes to a valid character or the
> U+FFFD replacement character.

This buffering behaviour would be necessary for a multi-byte encodings
like UTF-8.