[Python-Dev] File system path encoding on Windows

Victor Stinner victor.stinner at gmail.com
Mon Aug 29 18:38:31 EDT 2016


Hi,

tl; dr: just drop byte support and help developers to use Unicode in
their application!


As you may already know, I dislike your whole project. But first of
all, IMHO you should open a separated thread to discuss changes
related to the Windows console. I consider that they are unrelated,
well defined and makes sense to everyone, no?


It would help the discussion to estimate how much code is going to
break on Windows if Python 3 drops byte support. Maybe try some common
and major applications like Django and Twisted? My expectation is that
very few code explicitly uses bytes path, and that most applicatons
already use Unicode, just because it's much more easier to use Unicode
than bytes on Python 3.

Mercurial does its best to keep filenames as bytes, unchanged. I tried
once to port Mercurial to Python 3, and I recall that it was very
painful to handle filenames as bytes. A simple example:
print(filename) raises a BytesWarning (or displays a quoted string
prefixed with 'b', not really the expected result).


You always use the same example os.listdir(bytes). Ok, this one is a
mess on Windows. But modifying the value of
sys.getfilesystemencoding() has a broad effect: it changes *all*
functions in Python (ok, not really "all", but I'm trying to say that
a lot of functions use it). It's not only filenames: hostnames
(sockets), input/output data from other applications (subprocess
pipes), file content, command line arguments, environment variables,
etc. also use the ANSI code page.

Well, I know well UNIX where everything is stored as bytes. But I know
that hopefully, Windows has cool functions providing data directly as
Unicode, avoiding any mojibake issues, at least for command line
arguments and environment variables. But while your application may be
carefuly with Unicode, in practice "data" is transfered between many
channels and suffers from conversions.

There is no such single channel well defined. For example, "pipes" are
a pain on Windows. There is no "Unicode pipe" to exchange *text*
between two processes. I only know "cmd /u" (which seems specific and
restricted to cmd.exec) or _setmode() which has its own set of issues:
http://bugs.python.org/issue16587

See also this old article
http://archives.miloush.net/michkap/archive/2008/03/18/8306597.html


If Python 3.6 is going to speak UTF-8 only, there is a high risk that
it is going to produce data unreadable by other applications. For
example, write a filename into a file. I expect that the filename will
be stored as bytes, so as UTF-8. If another application tries to
decode the file content from the ANSI code page, it's likely that you
will get a "fatal" Unicode decoding error at the first non-ASCII
character.

The status quo is not better, you cannot store Japanese filenames in a
file if your ANSI code page is french. But if you decide to produce a
UTF-8 file, the effect will be restricted to this file, not to *all*
input and output data! For example, distutils was slowly upgraded
(from "encoding not specified by any spec, but more likely the locale
encoding") to use UTF-8 everywhere ;-)


More generally, I understand that you propose to explicitly break the
interoperability with other applications. Python 2 can be used as an
example of "application" expecting data encoded to the ANSI code page.

Yeah, I expect that storing text as UTF-8 inside the same process will
reduce the number of unicode errors *inside the same process*. It's
likely that you will no more get unicode errors if you control all
data inside the same process.

But for me, interoperability is more important than benefits of your
proposed changes.


Another way to express my concern: I don't understand how you plan to
"transcode" data from/to other applications. Will subprocess guess the
input and output encoding and transcode between the guessed encoding
and UTF-8?

Even today with Python 3.5, it's already easy to get mojibake betwen
the ANSI code page, the OEM code page, and some other encodings used
by an application. In short, you introduce yet another encoding which
is rare on Windows and incompatible with other encodings (well, except
of ASCII which is subset of many encodings, most encoings are
incompatible with all other encodings).


To me, you are building a giant patch to hide the mess, whereas this
problem was already solved in 1991 with the release of Unicode 1.0.
I'm not saying that it's a mistake to use UTF-8 internally. It's more
than I don't think that it's worth it. There is a risk that it adds
extra work to "support UTF-8", whereas this energy would be better
used to support Unicode, no?


I suggest to either fix the problem by really dropping bytes support,
as "announced" in Python 3.3 with the deprecation, or do nothing (and
continue to suffer).

I'm a supporter of the former option :-) Force developers to fix their
application and learn Unicode the hard way :-)

>From what I heard, Unicode is not really a pain of Python 3. The pain
comes from porting Python 2 code to Python 3, because not only you
have to do the boring refactoring work to fix Python 3 syntax issues,
but also keep Python 2 compatibility, *and* fix *all* Unicode issues
*at once*. From what I heard, text handling in Python 3 is really a
pleasure since the "encoding issue" (Unicode?) is "solved".

We only solved the issue by forcing developers to use Unicode
everywhere. I don't think that we can continue to make shy compromises
on this topic anymore...


Ok. Maybe I'm just wrong and using UTF-8 internally will be super cool :-)

Victor


More information about the Python-Dev mailing list