[Python-Dev] File system path encoding on Windows

Fri Aug 19 14:59:32 EDT 2016

Hi python-dev

About a week ago I proposed on python-ideas making some changes to how 
Python deals with encodings on Windows, specifically in relation to how 
Python interacts with the operating system.

Changes to the console were uncontroversial, and I have posted patches 
at http://bugs.python.org/issue1602 and 
http://bugs.python.org/issue17620 to enable the full range of Unicode 
input to be used at interactive stdin/stdout.

However, changes to sys.getfilesystemencoding(), which determines how 
the os module (and most filesystem functions in general) interpret bytes 
parameters, were more heatedly discussed. I've summarised the discussion 
in this email

I'll declare up front that my preferred change is to treat bytes as 
utf-8 in Python 3.6, and I've posted a patch to do that at 
http://bugs.python.org/issue27781. Hopefully I haven't been too biased 
in my presentation of the alternatives, but this is so you at least know 
which way I'm biased.

I'm looking for some agreement on the answers to the questions I pose in 
the summary.

There is much more detail about them presented after that, as there are 
a number of non-obvious issues at play here. I suspect this will 
eventually become a PEP, but it's presented here as a summary of a 
discussion and not a PEP.

Cheers,
Steve

Summary
=======

Representing file system paths on Windows as bytes may result in data 
loss due to the way Windows encodes/decodes strings via its bytes API.

We can mitigate this by only using Window's Unicode API and doing our 
own encoding and decoding (i.e. within posixmodule.c's path converter). 
Invalid characters could cause encoding exceptions rather than data loss.

We can go further to fix this by declaring the encoding of bytes paths 
on Windows must be utf-8, which would also prevent encoding exceptions, 
as utf-8 can fully represent all paths on Windows (natively utf-16-le).

Even though using bytes for paths on Windows has been deprecated for 
three releases, this is not widely known and it may be too soon to 
change the behaviour.

Questions:
* should we always use Window's Unicode APIs instead of switching 
between bytes/Unicode based on parameter type?
* should we allow users to pass bytes and interpret them as utf-8 rather 
than letting Windows do the decoding?
* should we do it in 3.6, 3.7 or 3.8?

Background
==========

File system paths are almost universally represented as text in some 
encoding determined by the file system. In Python, we expose these paths 
via a number of interfaces, such as the os and io modules. Paths may be 
passed either direction across these interfaces, that is, from the 
filesystem to the application (for example, os.listdir()), or from the 
application to the filesystem (for example, os.unlink()).

When paths are passed between the filesystem and the application, they 
are either passed through as a bytes blob or converted to/from str using 
sys.getfilesystemencoding(). The result of encoding a string with 
sys.getfilesystemencoding() is a blob of bytes in the native format for 
the default file system.

On Windows, the native format for the filesystem is utf-16-le. The 
recommended platform APIs for accessing the filesystem all accept and 
return text encoded in this format. However, prior to Windows NT (and 
possibly further back), the native format was a configurable machine 
option and a separate set of APIs existed to accept this format. The 
option (the "active code page") and these APIs (the "*A functions") 
still exist in recent versions of Windows for backwards compatibility, 
though new functionality often only has a utf-16-le API (the "*W 
functions").

In Python, we recommend using str as the default format because (with 
the surrogateescape handling on POSIX), it can correctly round-trip all 
characters used in paths. On Windows this is strongly recommended 
because the legacy OS support for bytes cannot round-trip all characters 
used in paths. Our support for bytes explicitly uses the *A functions 
and hence the encoding for the bytes is "whatever the active code page 
is". Since the active code page cannot represent all Unicode characters, 
the conversion of a path into bytes can lose information without warning 
(and we can't get a warning from the OS here - more on this later).

As a demonstration of this:

 >>> open('test\uAB00.txt', 'wb').close()
 >>> import glob
 >>> glob.glob('test*')
['test\uab00.txt']
 >>> glob.glob(b'test*')
[b'test?.txt']

The Unicode character in the second call to glob has been replaced by a 
'?', which means passing the path back into the filesystem will result 
in a FileNotFoundError (though ironically, passing it back into glob() 
will find the file again, since '?' is a single-character wildcard). You 
can observe the same results in os.listdir() or any function that 
matches the return type to the parameter type.

Why is this a problem?
======================

While the obvious and correct answer is to just use str everywhere, in 
general on POSIX systems there is no possibility of confusion when using 
bytes exclusively. Even if the encoding is "incorrect" by some standard, 
the file system can still map the bytes back to the file. Making use of 
this avoids the cost of decoding and reencoding, such that 
(theoretically, and only on POSIX), code like below is faster because of 
the use of `b'.'`:

 >>> for f in os.listdir(b'.'):
...     os.stat(f)
...

On Windows, if a filename exists that cannot be encoding with the active 
code page, you will receive an error from the above code. These errors 
are why in Python 3.3 the use of bytes paths on Windows was deprecated 
(listed in the What's New, but not clearly obvious in the documentation 
- more on this later). The above code produces multiple deprecation 
warnings in 3.3, 3.4 and 3.5 on Windows.

However, we still keep seeing libraries use bytes paths, which can cause 
unexpected issues on Windows (well, all platforms, but less and less 
common on POSIX as systems move to utf-8 - Windows long ago decided to 
move to utf-16 for the same reason, but Python's bytes interface did not 
keep up). Given the current approach of not-very-aggressively 
recommending that library developers either write their code twice (once 
for bytes and once for str) or use str exclusively are not working, we 
should consider alternative mitigations.

Proposals
=========

There are two dimensions here - the fix and the timing. We can basically 
choose any fix and any timing.

The main differences between the fixes are the balance between incorrect 
behaviour and backwards-incompatible behaviour. The main issue with 
respect to timing is whether or not we believe using bytes as paths on 
Windows was correctly deprecated in 3.3 and sufficiently advertised 
since to allow us to change the behaviour in 3.6.

Fixes
-----

Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows

Currently the default filesystem encoding is 'mbcs', which is a 
meta-encoder that uses the active code page. However, when bytes are 
passed to the filesystem they go through the *A APIs and the operating 
system handles encoding. In this case, paths are always encoded using 
the equivalent of 'mbcs:replace' - we have no ability to change this 
(though there is a user/machine configuration option to change the 
encoding from CP_ACP to CP_OEM, so it won't necessarily always match 
mbcs...)

This proposal would remove all use of the *A APIs and only ever call the 
*W APIs. When Windows returns paths to Python as str, they will be 
decoded from utf-16-le and returned as text. When paths are to be 
returned as bytes, we would decode from utf-16-le to utf-8 using 
surrogatepass (as Windows does not validate surrogate pairs, so it is 
possible to have invalid surrogates in filenames). Equally, when paths 
are provided as bytes, they are decoded from utf-8 into utf-16-le and 
passed to the *W APIs.

The use of utf-8 will not be configurable, with the possible exception 
of a "legacy mode" environment variable or Xflag.

surrogateescape does not apply here, as we are not concerned about 
keeping arbitrary bytes in the path. Any bytes path returned from the 
operating system will be valid; any bytes path created by the user may 
raise a decoding error (currently it would raise a file not found or 
similar OSError).

The choice of utf-8 (as opposed to returning utf-16-le bytes) is to 
ensure the ability to round-trip, while also allowing basic manipulation 
of paths - essentially just slicing and concatenating at '\' characters. 
Applications doing this have to ensure that their encoding matches 
sys.getfilesystemencoding(), or just use str everywhere.

It is debated, but I believe this is not a backwards compatibility issue 
because:
* byte paths in Python are specified as being encoded by 
sys.getfilesystemencoding()
* byte paths on Windows have been deprecated for three versions

Unfortunately, the deprecation is not explicitly called out anywhere in 
the docs apart from the What's New page, so there is an argument that it 
shouldn't be counted despite the warnings in the interpreter. However, 
this is more directly addressed in the discussion of timing below.

Equally, sys.getfilesystemencoding() documents the specific return 
values for various platforms, as well as that it is part of the protocol 
for using bytes to represent filesystem strings.

I believe both of these arguments are invalid, that the only code that 
will break as a result of this change is relying on deprecated 
functionality and incorrect encoding, and that the (probably noisy) 
breakage that will occur is less bad than the silent breakage that 
currently exists.

As far as implementation goes, there is already a patch for this at 
http://bugs.python.org/issue27781. In short, we update the path 
converter to decode bytes (path->narrow) to Unicode (path->wide) and 
remove all the code that would call *A APIs. In my patch I've changed 
path->narrow to a flag that indicates whether to convert back to bytes 
on return, and also to prevent compilation of code that tries to use 
->narrow as a string on Windows (maybe that will get too annoying for 
contributors? good discussion for the tracker IMHO).

Fix #2: Do the mbcs decoding ourselves

This is essentially the same as fix #1, but instead of changing to utf-8 
we keep mbcs as the encoding.

This approach will allow us to utilise new functionality that is only 
available as *W APIs, and also lets us be more strict about 
encoding/decoding to bytes. For example, rather than silently replacing 
Unicode characters with '?', we could warn or fail the operation, 
potentially modifying that behaviour with an environment variable or flag.

Compared to fix #1, this will enable some new functionality but will not 
fix any of the problems immediately. New runtime errors may cause some 
problems to be more obvious and lead to fixes, provided library 
maintainers are interested in supporting Windows and adding a separate 
code path to treat filesystem paths as strings.

This is a middle-ground proposal. On the positive side, it significantly 
reduces the code we have to maintain in CPython (e.g. posixmodule.c), as 
we won't require separate code paths to call the *A APIs. However, it 
doesn't really improve things for users apart from giving more 
exceptions, which are likely unexpected (people probably handle OSError 
but not UnicodeDecodeError when accessing the file system).

Fix #3: Make bytes paths on Windows an error

By preventing the use of bytes paths on Windows completely we prevent 
users from hitting encoding issues. However, we do this at the expense 
of usability. Obviously the deprecation concerns also play a big role in 
whether this is feasible.

I don't have numbers of libraries that will simply fail on Windows if 
this "fix" is made, but given I've already had people directly email me 
and tell me about their problems we can safely assume it's non-zero.

I'm really not a fan of this fix, because it doesn't actually make 
things better in a practical way, despite being more "pure".

Timing #1: Change it in 3.6

This timing assumes that we believe the deprecation of using bytes for 
paths in Python 3.3 was sufficiently well advertised that we can freely 
make changes in 3.6. A typical deprecation cycle would be two versions 
before removal (though we also often leave things in forever when they 
aren't fundamentally broken), so we have passed that point and 
theoretically can remove or change the functionality without breaking it.

In this case, we would announce in 3.6 that using bytes as paths on 
Windows is no longer deprecated, and that the encoding used is whatever 
is returned by sys.getfilesystemencoding().

Timing #2: Change it in 3.7

This timing assumes that the deprecation in 3.3 was valid, but 
acknowledges that it was not well publicised. For 3.6, we aggressively 
make it known that only strings should be used to represent paths on 
Windows and bytes are invalid and going to change in 3.7. (It has been 
suggested that I could use a keynote at PyCon to publicise this, and 
while I'd totally accept a keynote, I'd hate to subject a crowd to just 
this issue for an hour :) ).

My concern with this approach is that there is no benefit to the change 
at all. If we aggressively publicise the fact that libraries that don't 
handle Unicode paths on Windows properly are using deprecated 
functionality and need to be fixed by 3.7 in order to avoid breaking 
(more precisely - continuing to be broken, but with a different error 
message), then we will alienate non-Windows developers further from the 
platform (net loss for the ecosystem) and convince some to switch to str 
everywhere (net gain for the ecosystem). It doesn't

For those who listen and change to str, it removes the need to make any 
change in 3.7 at all, so we would really just be making noise about 
something that some people may not have noticed without necessarily 
going in and fixing anything. For those who don't listen, the change in 
3.7 is going to break them just as much as if we made the change in 3.6.

Timing #3: Change it in 3.8

This timing assumes that the deprecation in 3.3 was not sufficient and 
we need to start a new deprecation cycle. This is strengthened by the 
fact that the deprecation announcement does not explicitly include the 
io module or the builtin open() function, and so some developers may 
believe that using bytes for paths with these is okay despite the os 
module being deprecated.

The one upside to this approach is that it would also allow us to change 
locale.getpreferredencoding() to utf-8 on Windows (to affect the default 
behaviour of open(..., 'r') ), which I don't believe is going to be 
possible without a new deprecation cycle. There is a strong argument 
that the following code should also round-trip regardless of platform:

 >>> with open('list.txt', 'w') as f:
...     for i in os.listdir('.'):
...         print(i, file=f)
...
 >>> with open('list.txt', 'r') as f:
...     files = list(f)
...

Currently, the default encoding for open() cannot represent all 
filenames that may be returned from listdir(). This may affect makefiles 
and configuration files that contain paths. Currently they will work 
correctly for paths that can be represented in the machine's active code 
page (though it should be noted that the *A APIs may be changed in a 
process by user/machine configuration to use the OEM code page rather 
than the active code page, which would potentially lead to encoding 
issues even for CP_ACP compatible names).

Possibly resolving both issues simultaneously is worth waiting for two 
more releases? I'm not convinced the change to getfilesystemencoding() 
needs to wait for getpreferredencoding() to also change, or that they 
necessarily need to match, but it would not be hugely surprising to see 
the changes bundled together.

I'll also note that there has been limited discussion about changing 
getpreferredencoding() so far, though there have been a number of "+1" 
votes alongside some "+1 with significant concerns" votes. Changing the 
default encoding of the contents of data files is pretty scary, so I'm 
not in any rush to force it in. On the other hand, changing the encoding 
for paths without changing the default encoding for text files may break 
"bytes in, bytes through, bytes out" for some files (especially 
makefiles and .ini files). Arguably this idea was already deprecated 
with Python 3's bytes/text separation anyway.

Acknowledgements
================

Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for 
their significant contributions and willingness to engage, and to 
everyone else on python-ideas for contributing to the discussion.