[Python-ideas] Fix default encodings on Windows

Wed Aug 10 14:10:53 EDT 2016

I suspect there's a lot of discussion to be had around this topic, so I 
want to get it started. There are some fairly drastic ideas here and I 
need help figuring out whether the impact outweighs the value.

Some background: within the Windows API, the preferred encoding is 
UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs 
that use it. These APIs are generally referred to as the *W APIs 
(because they have a W suffix).

There are also (broadly deprecated) APIs that use an 8-bit format 
(char), where the encoding is assumed to be "the user's active code 
page". These are *A APIs. AFAIK, there are no cases where a *A API 
should be preferred over a *W API, and many newer APIs are *W only.

In general, Python passes byte strings into the *A APIs and text strings 
into the *W APIs.

Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which 
translates to "the system's active code page". As this encoding 
generally cannot represent all paths on Windows, it is deprecated and 
Unicode strings are recommended instead. This, however, means you need 
to write significantly different code between POSIX (use bytes) and 
Windows (use text).

ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and 
updating path_converter() (Python/posixmodule.c; likely similar code in 
other places) to decode incoming byte strings would allow us to 
undeprecate byte strings and add the requirement that they *must* be 
encoded with sys.getfilesystemencoding(). I assume that this would allow 
cross-platform code to handle paths similarly by encoding to whatever 
the sys module says they should and using bytes consistently (starting 
this thread is meant to validate/refute my assumption).

(Yes, I know that people on POSIX should just change to using Unicode 
and surrogateescape. Unfortunately, rather than doing that they complain 
about Windows and drop support for the platform. If you want to keep 
hitting them with the stick, go ahead, but I'm inclined to think the 
carrot is more valuable here.)

Similarly, locale.getpreferredencoding() on Windows returns a legacy 
value - the user's active code page - which should generally not be used 
for any reason. The one exception is as a default encoding for opening 
files when no other information is available (e.g. a Unicode BOM or 
explicit encoding argument). BOMs are very common on Windows, since the 
default assumption is nearly always a bad idea.

Making open()'s default encoding detect a BOM before falling back to 
locale.getpreferredencoding() would resolve many issues, but I'm also 
inclined towards making the fallback utf-8, leaving 
locale.getpreferredencoding() solely as a way to get the active system 
codepage (with suitable warnings about it only being useful for 
back-compat). This would match the behavior that the .NET Framework has 
used for many years - effectively, utf_8_sig on read and utf_8 on write.

Finally, the encoding of stdin, stdout and stderr are currently 
(correctly) inferred from the encoding of the console window that Python 
is attached to. However, this is typically a codepage that is different 
from the system codepage (i.e. it's not mbcs) and is almost certainly 
not Unicode. If users are starting Python from a console, they can use 
"chcp 65001" first to switch to UTF-8, and then *most* functionality 
works (input() has some issues, but those can be fixed with a slight 
rewrite and possibly breaking readline hooks).

It is also possible for Python to change the current console encoding to 
be UTF-8 on initialize and change it back on finalize. (This would leave 
the console in an unexpected state if Python segfaults, but console 
encoding is probably the least of anyone's worries at that point.) So 
I'm proposing actively changing the current console to be Unicode while 
Python is running, and hence sys.std[in|out|err] will default to utf-8.

So that's a broad range of changes, and I have little hope of figuring 
out all the possible issues, back-compat risks, and flow-on effects on 
my own. Please let me know (either on-list or off-list) how a change 
like this would affect your projects, either positively or negatively, 
and whether you have any specific experience with these changes/fixes 
and think they should be approached differently.

To summarise the proposals (remembering that these would only affect 
Python 3.6 on Windows):

* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths
* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else 
use sys.getpreferredencoding()
* force the console encoding to UTF-8 on initialize and revert on finalize

So what are your concerns? Suggestions?

Thanks,
Steve