[Python-ideas] Fix default encodings on Windows

eryk sun eryksun at gmail.com
Fri Aug 12 22:44:00 EDT 2016


On Fri, Aug 12, 2016 at 2:20 PM, Random832 <random832 at fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
>> * force the console encoding to UTF-8 on initialize and revert on
>> finalize
>>
>> So what are your concerns? Suggestions?
>
> As far as I know, the single biggest problem caused by the status quo
> for console encoding is "some string containing characters not in the
> console codepage is printed out; unhandled UnicodeEncodeError". Is there
> any particular reason not to use errors='replace'?

If that's all you want then you can set PYTHONIOENCODING=:replace.
Prepare to be inundated with question marks.

Python's 'cp*' encodings are cross-platform, so they don't call
Windows NLS APIs. If you want a best-fit encoding, then 'mbcs' is the
only choice. Use chcp.com to switch to your system's ANSI codepage and
set PYTHONIOENCODING=mbcs:replace.

An 'oem' encoding could be added, but I'm no fan of these best-fit
encodings. Writing question marks at least hints that the output is
wrong.

> Is there any particular reason for the REPL, when printing the repr of a
> returned object, not to replace characters not in the stdout encoding
> with backslash sequences?

sys.displayhook already does this. It falls back on
sys_displayhook_unencodable if printing the repr raises a
UnicodeEncodeError.

> Does Python provide any mechanism to access the built-in "best fit"
> mappings for windows codepages (which mostly consist of removing accents
> from latin letters)?

As mentioned above, for output this is only available with 'mbcs'. For
reading input via ReadFile or ReadConsoleA (and thus also C _read,
fread, and fgets), the console already encodes its UTF-16 input buffer
using a best-fit encoding to the input codepage. So there's no error
in the following example, even though the result is wrong:

    >>> sys.stdin.encoding
    'cp437'
    >>> s = 'Ā'
    >>> s, ord(s)
    ('A', 65)

Jumping back to the codepage 65001 discussion, here's a function to
simulate the bad output that Windows Vista and 7 users see:

    def write(text):
        writes = []
        n = 0
        buffer = text.replace('\n', '\r\n').encode('utf-8')
        while buffer:
            decoded = buffer.decode('utf-8', 'replace')
            buffer = buffer[len(decoded):]
            writes.append(decoded.replace('\r', '\n'))
        return ''.join(writes)

For example:

    >>> greek = 'αβγδεζηθι\n'
    >>> write(greek)
    'αβγδεζηθι\n\n�ηθι\n\n�\n\n'

It gets worse with characters that require 3 bytes in UTF-8:

    >>> devanagari = 'ऄअआइईउऊऋऌ\n'
    >>> write(devanagari)
    'ऄअआइईउऊऋऌ\n\n�ईउऊऋऌ\n\n��ऋऌ\n\n��\n\n'

This problem doesn't exit in Windows 8+ because the old LPC-based
communication (LPC is an undocumented protocol that's used extensively
for IPC between Windows subsystems) with the console was rewritten to
use a kernel driver (condrv.sys). Now it works like any other device
by calling NtReadFile, NtWriteFile, and NtDeviceIoControlFile.
Apparently in the rewrite someone fixed the fact that the conhost code
that handles WriteFile and WriteConsoleA was incorrectly returning the
number of UTF-16 codes written instead of the number of bytes.

Unfortunately the rewrite also broke Ctrl+C handling because ReadFile
no longer sets the last error to ERROR_OPERATION_ABORTED when a
console read is interrupted by Ctrl+C. I'm surprised so few Windows
users have noticed or cared that Ctrl+C kills the REPL and misbehaves
with input() in the Windows 8/10 console. The source of the Ctrl+C bug
is an incorrect NTSTATUS code STATUS_ALERTED, which should be
STATUS_CANCELLED. The console has always done this wrong, but before
the rewrite there was common code for ReadFile and ReadConsole that
handled STATUS_ALERTED specially. It's still there in ReadConsole, so
Ctrl+C handling works fine in Unicode programs that use ReadConsoleW
(e.g. cmd.exe, powershell.exe). It also works fine if
win_unicode_console is enabled.

Finally, here's a ctypes example in Windows 10.0.10586 that shows the
unsolvable problem with non-ASCII input when using codepage 65001:

    import ctypes, msvcrt
    conin = open(r'\\.\CONIN$', 'r+')
    hConin = msvcrt.get_osfhandle(conin.fileno())
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
    nread = (ctypes.c_uint * 1)()

ASCII-only input works:

    >>> buf = (ctypes.c_char * 100)()
    >>> kernel32.ReadFile(hConin, buf, 100, nread, None)
    spam
    1
    >>> nread[0], buf.value
    (6, b'spam\r\n')

But it returns EOF if "a" is replaced by Greek "α":

    >>> buf = (ctypes.c_char * 100)()
    >>> kernel32.ReadFile(hConin, buf, 100, nread, None)
    spαm
    1
    >>> nread[0], buf.value
    (0, b'')

Notice that the read is successful but nread is 0. That signifies EOF.
So the REPL will just silently quit as if you entered Ctrl+Z, and
input() will raise EOFError. This can't be worked around. The problem
is in conhost.exe, which assumes a request for N bytes wants N UTF-16
codes from the input buffer. This can only work with ASCII in UTF-8.


More information about the Python-ideas mailing list