[Python-ideas] Fix default encodings on Windows

Wed Aug 10 16:16:57 EDT 2016

On Wed, Aug 10, 2016 at 6:10 PM, Steve Dower <steve.dower at python.org> wrote:
> Similarly, locale.getpreferredencoding() on Windows returns a legacy value -
> the user's active code page - which should generally not be used for any
> reason. The one exception is as a default encoding for opening files when no
> other information is available (e.g. a Unicode BOM or explicit encoding
> argument). BOMs are very common on Windows, since the default assumption is
> nearly always a bad idea.

The CRT doesn't allow UTF-8 as a locale encoding because Windows
itself doesn't allow this. So locale.getpreferredencoding() can't
change, but in practice it can be ignored.

Speaking of locale, Windows Python should call setlocale(LC_CTYPE, "")
in pylifecycle.c in order to work around an inconsistency between
LC_TIME and LC_CTYPE in the the default "C" locale. The former is ANSI
while the latter is effectively Latin-1, which leads to mojibake in
time.tzname and elsewhere. Calling setlocale(LC_CTYPE, "") is already
done on most Unix systems, so this would actually improve
cross-platform consistency.

> Finally, the encoding of stdin, stdout and stderr are currently (correctly)
> inferred from the encoding of the console window that Python is attached to.
> However, this is typically a codepage that is different from the system
> codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users
> are starting Python from a console, they can use "chcp 65001" first to
> switch to UTF-8, and then *most* functionality works (input() has some
> issues, but those can be fixed with a slight rewrite and possibly breaking
> readline hooks).

Using codepage 65001 for output is broken prior to Windows 8 because
WriteFile/WriteConsoleA returns (as an output parameter) the number of
decoded UTF-16 codepoints instead of the number of bytes written,
which makes a buffered writer repeatedly write garbage at the end of
each write in proportion to the number of non-ASCII characters. This
can be worked around by decoding to get the UTF-16 size before each
write, or by just blindly assuming that a console write always
succeeds in writing the entire buffer. In this case the console should
be detected by GetConsoleMode(). isatty() isn't right for this since
it's true for all character devices, which includes NUL among others.

Codepage 65001 is broken for non-ASCII input (via
ReadFile/ReadConsoleA) in all versions of Windows that I've tested,
including Windows 10. By attaching a debugger to conhost.exe you can
see how it fails in WideCharToMultiByte because it assumes one byte
per character. If you try to read 10 bytes, it assumes you're trying
to read 10 UTF-16 'characters' into a 10 byte buffer, which fails for
UTF-8 when even a single non-ASCII character is read. The
ReadFile/ReadConsoleA call returns that it successfully read 0 bytes,
which is interpreted as EOF. This cannot be worked around. The only
way to read the full range of Unicode from the console is via the
wide-character APIs ReadConsoleW and ReadConsoleInputW.

IMO, Python needs a C implementation of the win_unicode_console
module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
Python code never has to work directly with UTF-16 encoded text.