[issue27179] subprocess uses wrong encoding on Windows
Eryk Sun
report at bugs.python.org
Sat Jun 4 12:07:16 EDT 2016
Eryk Sun added the comment:
>> so ANSI is the natural default for a detached process
>
> To clarify - ANSI is the natural default *for programs that
> don't support Unicode*.
By natural, I meant in the context of using GetConsoleOutputCP(), since WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred for IPC on Windows. It's the native Unicode format down to the lowest levels of the kernel. But we're talking about old-school IPC using standard I/O pipelines, for which I think UTF-8 is a better fit.
> Forcing the use of UTF-8 as the code page is the easiest way
> for us to support it.
The console's behavior for codepage 65001 is too buggy. The show stopper is that it limits input to ASCII. The console allocates a temporary buffer for the encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the console returns that the operation has successfully read 0 bytes. Python's REPL and input() see this as EOF.
For example:
import sys, ctypes, msvcrt
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
conin = open(r'\\.\CONIN$', 'r+')
h = msvcrt.get_osfhandle(conin.fileno())
buf = (ctypes.c_char * 15)()
n = (ctypes.c_ulong * 1)()
>>> sys.stdin.encoding
'cp65001'
ReadFile test in Windows 10:
>>> kernel32.ReadFile(h, buf, 15, n, None)
Test!
1
>>> n[0], buf[:]
(7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')
>>> kernel32.ReadFile(h, buf, 15, n, None)
¡Prueba!
1
>>> n[0], buf[:]
(0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')
The second call obviously fails, even thought it returns 1. The input contains non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the failure in conhost.exe that I described above.
ReadConsoleA has the same problem:
>>> kernel32.ReadConsoleA(h, buf, 15, n, None)
Hello World!
1
>>> n[0], buf[:]
(14, b'Hello World!\r\n\x00')
>>> kernel32.ReadConsoleA(h, buf, 15, n, None)
¡Hola Mundo!
1
>>> n[0], buf[:]
(0, b'Hello World!\r\n\x00')
UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile returns the number of UTF-16 codes written instead of the number of bytes. For non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it looks like a partial write. A buffered writer will loop multiple times to write what appears to be the remaining bytes, in a trail of junk lines in proportion to the number of non-ASCII characters written.
Python could work around this by decoding the buffer to get the corresponding number of UTF-16 codes written in the console, but child processes may also be subject to this bug. The only general solution on Windows 7 is to use something like ANSICON, which uses DLL injection to hook and wrap WriteFile and WriteConsoleA.
There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do console codepage conversions, such as more.com. This in turn affects Python's interactive help(). I looked at this in issue 19914. The ulib bug is fixed in Windows 10. I don't know whether it's fixed in Windows 8, but it's there in Windows 7 (supported until 2020).
> This would make Python's implementation much more
> complicated, as well as breaking some scripts and
> existing packages.
Unless you're talking about major breakage, I think switching to the wide-character API is worth it, as the only viable path to supporting Unicode in the console. The implementation probably should transcode between UTF-16LE and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding would be 'utf-8'. os.read and os.write would be implemented as _Py_read and _Py_write (already exists). For console handles these could delegate to _Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE and call ReadConsoleW and WriteConsoleW.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue27179>
_______________________________________
More information about the Python-bugs-list
mailing list