[Python-ideas] Fix default encodings on Windows

Victor Stinner victor.stinner at gmail.com
Tue Aug 16 06:28:58 EDT 2016


2016-08-16 8:06 GMT+02:00 eryk sun <eryksun at gmail.com>:
> My proposal was to use the wide-character APIs, but transcoding CP_ACP
> without best-fit characters and raising a warning whenever the default
> character is used (e.g. substituting Katakana middle dot when creating
> a file using a bytes path that has an invalid sequence in CP932).

A problem with all these proposal is that they *add* new code to the
CPython code base, code specific to Windows. There are very few core
developers (1 or 2?) who work on this code specific to Windows.


I would prefer to *drop* code specific to Windows rather that *adding*
(or changing) code specific to Windows, just to make the CPython code
base simpler to maintain.

It's already annoying enough. It's common that a Python function has
one implementation for all platforms except Windows, and a second
implementation specific to Windows.

An example: os.listdir()

* ~150 lines of C code for the Windows implementation
* ~100 lines of C code for the UNIX/BSD implementation
* The Windows implementation is splitted in two parts: Unicode and
bytes, so the code is basically duplicated (2 versions)

If you remove the bytes support, the Windows function is reduced to
100 lines (-50).


I'm not sure that modifying the API using byte would solve any issue
on Windows, and there is an obvious risk of regression (mojibake when
you concatenerate strings encoded to UTF-8 and to ANSI code page).

I'm in favor of forcing developers to use Unicode on Windows, which is
the correct way to use the Windows API. The side effect is that such
code works perfectly well on UNIX/BSD ;-) To be clear: drop the
deprecated code to support bytes on Windows.

I already proposed to drop bytes support on Windows and most answers
were "please keep them", so another option is to keep the "broken
code" as the status quo...

I really hate APIs using bytes on Windows because they use
WideCharToMultiByte() (encode unicode to bytes) in a mode which is
likely to lead to mojibake: unencodable characters are replaced with
"best fit characters" or "?".
https://unicodebook.readthedocs.io/operating_systems.html#encode-and-decode-functions


In a perfect world, I would also propose to deprecate bytes filenames
on UNIX, but I expect an insane flamewar on the definition of "UNIX",
history of UNIX, etc. (non technical discussion, since Unicode works
very well on Python 3...).

Victor


More information about the Python-ideas mailing list