[Python-Dev] File system path encoding on Windows

Fri Aug 19 15:25:24 EDT 2016

#1 sounds like a great idea. I suppose surrogatepass solves approximately
the same problem of Rust's WTF-8, which is a way to round-trip bad UCS-2?
https://simonsapin.github.io/wtf-8/

#2 sounds like it would leave several problems, since mbcs is not the same
as a normal text encoding, IIUC it depends on the active code page. So if
your active code page is Russian you might not be able to encode Japanese
characters into MBCS.

Solution #2a Modify Windows so utf-8 is a valid value for the current MBCS
code page.

On Fri, Aug 19, 2016 at 3:01 PM Steve Dower <steve.dower at python.org> wrote:

> Hi python-dev
>
> About a week ago I proposed on python-ideas making some changes to how
> Python deals with encodings on Windows, specifically in relation to how
> Python interacts with the operating system.
>
> Changes to the console were uncontroversial, and I have posted patches
> at http://bugs.python.org/issue1602 and
> http://bugs.python.org/issue17620 to enable the full range of Unicode
> input to be used at interactive stdin/stdout.
>
> However, changes to sys.getfilesystemencoding(), which determines how
> the os module (and most filesystem functions in general) interpret bytes
> parameters, were more heatedly discussed. I've summarised the discussion
> in this email
>
> I'll declare up front that my preferred change is to treat bytes as
> utf-8 in Python 3.6, and I've posted a patch to do that at
> http://bugs.python.org/issue27781. Hopefully I haven't been too biased
> in my presentation of the alternatives, but this is so you at least know
> which way I'm biased.
>
> I'm looking for some agreement on the answers to the questions I pose in
> the summary.
>
> There is much more detail about them presented after that, as there are
> a number of non-obvious issues at play here. I suspect this will
> eventually become a PEP, but it's presented here as a summary of a
> discussion and not a PEP.
>
> Cheers,
> Steve
>
> Summary
> =======
>
> Representing file system paths on Windows as bytes may result in data
> loss due to the way Windows encodes/decodes strings via its bytes API.
>
> We can mitigate this by only using Window's Unicode API and doing our
> own encoding and decoding (i.e. within posixmodule.c's path converter).
> Invalid characters could cause encoding exceptions rather than data loss.
>
> We can go further to fix this by declaring the encoding of bytes paths
> on Windows must be utf-8, which would also prevent encoding exceptions,
> as utf-8 can fully represent all paths on Windows (natively utf-16-le).
>
> Even though using bytes for paths on Windows has been deprecated for
> three releases, this is not widely known and it may be too soon to
> change the behaviour.
>
> Questions:
> * should we always use Window's Unicode APIs instead of switching
> between bytes/Unicode based on parameter type?
> * should we allow users to pass bytes and interpret them as utf-8 rather
> than letting Windows do the decoding?
> * should we do it in 3.6, 3.7 or 3.8?
>
> Background
> ==========
>
> File system paths are almost universally represented as text in some
> encoding determined by the file system. In Python, we expose these paths
> via a number of interfaces, such as the os and io modules. Paths may be
> passed either direction across these interfaces, that is, from the
> filesystem to the application (for example, os.listdir()), or from the
> application to the filesystem (for example, os.unlink()).
>
> When paths are passed between the filesystem and the application, they
> are either passed through as a bytes blob or converted to/from str using
> sys.getfilesystemencoding(). The result of encoding a string with
> sys.getfilesystemencoding() is a blob of bytes in the native format for
> the default file system.
>
> On Windows, the native format for the filesystem is utf-16-le. The
> recommended platform APIs for accessing the filesystem all accept and
> return text encoded in this format. However, prior to Windows NT (and
> possibly further back), the native format was a configurable machine
> option and a separate set of APIs existed to accept this format. The
> option (the "active code page") and these APIs (the "*A functions")
> still exist in recent versions of Windows for backwards compatibility,
> though new functionality often only has a utf-16-le API (the "*W
> functions").
>
> In Python, we recommend using str as the default format because (with
> the surrogateescape handling on POSIX), it can correctly round-trip all
> characters used in paths. On Windows this is strongly recommended
> because the legacy OS support for bytes cannot round-trip all characters
> used in paths. Our support for bytes explicitly uses the *A functions
> and hence the encoding for the bytes is "whatever the active code page
> is". Since the active code page cannot represent all Unicode characters,
> the conversion of a path into bytes can lose information without warning
> (and we can't get a warning from the OS here - more on this later).
>
> As a demonstration of this:
>
>  >>> open('test\uAB00.txt', 'wb').close()
>  >>> import glob
>  >>> glob.glob('test*')
> ['test\uab00.txt']
>  >>> glob.glob(b'test*')
> [b'test?.txt']
>
> The Unicode character in the second call to glob has been replaced by a
> '?', which means passing the path back into the filesystem will result
> in a FileNotFoundError (though ironically, passing it back into glob()
> will find the file again, since '?' is a single-character wildcard). You
> can observe the same results in os.listdir() or any function that
> matches the return type to the parameter type.
>
> Why is this a problem?
> ======================
>
> While the obvious and correct answer is to just use str everywhere, in
> general on POSIX systems there is no possibility of confusion when using
> bytes exclusively. Even if the encoding is "incorrect" by some standard,
> the file system can still map the bytes back to the file. Making use of
> this avoids the cost of decoding and reencoding, such that
> (theoretically, and only on POSIX), code like below is faster because of
> the use of `b'.'`:
>
>  >>> for f in os.listdir(b'.'):
> ...     os.stat(f)
> ...
>
> On Windows, if a filename exists that cannot be encoding with the active
> code page, you will receive an error from the above code. These errors
> are why in Python 3.3 the use of bytes paths on Windows was deprecated
> (listed in the What's New, but not clearly obvious in the documentation
> - more on this later). The above code produces multiple deprecation
> warnings in 3.3, 3.4 and 3.5 on Windows.
>
> However, we still keep seeing libraries use bytes paths, which can cause
> unexpected issues on Windows (well, all platforms, but less and less
> common on POSIX as systems move to utf-8 - Windows long ago decided to
> move to utf-16 for the same reason, but Python's bytes interface did not
> keep up). Given the current approach of not-very-aggressively
> recommending that library developers either write their code twice (once
> for bytes and once for str) or use str exclusively are not working, we
> should consider alternative mitigations.
>
> Proposals
> =========
>
> There are two dimensions here - the fix and the timing. We can basically
> choose any fix and any timing.
>
> The main differences between the fixes are the balance between incorrect
> behaviour and backwards-incompatible behaviour. The main issue with
> respect to timing is whether or not we believe using bytes as paths on
> Windows was correctly deprecated in 3.3 and sufficiently advertised
> since to allow us to change the behaviour in 3.6.
>
> Fixes
> -----
>
> Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows
>
> Currently the default filesystem encoding is 'mbcs', which is a
> meta-encoder that uses the active code page. However, when bytes are
> passed to the filesystem they go through the *A APIs and the operating
> system handles encoding. In this case, paths are always encoded using
> the equivalent of 'mbcs:replace' - we have no ability to change this
> (though there is a user/machine configuration option to change the
> encoding from CP_ACP to CP_OEM, so it won't necessarily always match
> mbcs...)
>
> This proposal would remove all use of the *A APIs and only ever call the
> *W APIs. When Windows returns paths to Python as str, they will be
> decoded from utf-16-le and returned as text. When paths are to be
> returned as bytes, we would decode from utf-16-le to utf-8 using
> surrogatepass (as Windows does not validate surrogate pairs, so it is
> possible to have invalid surrogates in filenames). Equally, when paths
> are provided as bytes, they are decoded from utf-8 into utf-16-le and
> passed to the *W APIs.
>
> The use of utf-8 will not be configurable, with the possible exception
> of a "legacy mode" environment variable or Xflag.
>
> surrogateescape does not apply here, as we are not concerned about
> keeping arbitrary bytes in the path. Any bytes path returned from the
> operating system will be valid; any bytes path created by the user may
> raise a decoding error (currently it would raise a file not found or
> similar OSError).
>
> The choice of utf-8 (as opposed to returning utf-16-le bytes) is to
> ensure the ability to round-trip, while also allowing basic manipulation
> of paths - essentially just slicing and concatenating at '\' characters.
> Applications doing this have to ensure that their encoding matches
> sys.getfilesystemencoding(), or just use str everywhere.
>
> It is debated, but I believe this is not a backwards compatibility issue
> because:
> * byte paths in Python are specified as being encoded by
> sys.getfilesystemencoding()
> * byte paths on Windows have been deprecated for three versions
>
> Unfortunately, the deprecation is not explicitly called out anywhere in
> the docs apart from the What's New page, so there is an argument that it
> shouldn't be counted despite the warnings in the interpreter. However,
> this is more directly addressed in the discussion of timing below.
>
> Equally, sys.getfilesystemencoding() documents the specific return
> values for various platforms, as well as that it is part of the protocol
> for using bytes to represent filesystem strings.
>
> I believe both of these arguments are invalid, that the only code that
> will break as a result of this change is relying on deprecated
> functionality and incorrect encoding, and that the (probably noisy)
> breakage that will occur is less bad than the silent breakage that
> currently exists.
>
> As far as implementation goes, there is already a patch for this at
> http://bugs.python.org/issue27781. In short, we update the path
> converter to decode bytes (path->narrow) to Unicode (path->wide) and
> remove all the code that would call *A APIs. In my patch I've changed
> path->narrow to a flag that indicates whether to convert back to bytes
> on return, and also to prevent compilation of code that tries to use
> ->narrow as a string on Windows (maybe that will get too annoying for
> contributors? good discussion for the tracker IMHO).
>
>
> Fix #2: Do the mbcs decoding ourselves
>
> This is essentially the same as fix #1, but instead of changing to utf-8
> we keep mbcs as the encoding.
>
> This approach will allow us to utilise new functionality that is only
> available as *W APIs, and also lets us be more strict about
> encoding/decoding to bytes. For example, rather than silently replacing
> Unicode characters with '?', we could warn or fail the operation,
> potentially modifying that behaviour with an environment variable or flag.
>
> Compared to fix #1, this will enable some new functionality but will not
> fix any of the problems immediately. New runtime errors may cause some
> problems to be more obvious and lead to fixes, provided library
> maintainers are interested in supporting Windows and adding a separate
> code path to treat filesystem paths as strings.
>
> This is a middle-ground proposal. On the positive side, it significantly
> reduces the code we have to maintain in CPython (e.g. posixmodule.c), as
> we won't require separate code paths to call the *A APIs. However, it
> doesn't really improve things for users apart from giving more
> exceptions, which are likely unexpected (people probably handle OSError
> but not UnicodeDecodeError when accessing the file system).
>
>
> Fix #3: Make bytes paths on Windows an error
>
> By preventing the use of bytes paths on Windows completely we prevent
> users from hitting encoding issues. However, we do this at the expense
> of usability. Obviously the deprecation concerns also play a big role in
> whether this is feasible.
>
> I don't have numbers of libraries that will simply fail on Windows if
> this "fix" is made, but given I've already had people directly email me
> and tell me about their problems we can safely assume it's non-zero.
>
> I'm really not a fan of this fix, because it doesn't actually make
> things better in a practical way, despite being more "pure".
>
>
> Timing #1: Change it in 3.6
>
> This timing assumes that we believe the deprecation of using bytes for
> paths in Python 3.3 was sufficiently well advertised that we can freely
> make changes in 3.6. A typical deprecation cycle would be two versions
> before removal (though we also often leave things in forever when they
> aren't fundamentally broken), so we have passed that point and
> theoretically can remove or change the functionality without breaking it.
>
> In this case, we would announce in 3.6 that using bytes as paths on
> Windows is no longer deprecated, and that the encoding used is whatever
> is returned by sys.getfilesystemencoding().
>
>
> Timing #2: Change it in 3.7
>
> This timing assumes that the deprecation in 3.3 was valid, but
> acknowledges that it was not well publicised. For 3.6, we aggressively
> make it known that only strings should be used to represent paths on
> Windows and bytes are invalid and going to change in 3.7. (It has been
> suggested that I could use a keynote at PyCon to publicise this, and
> while I'd totally accept a keynote, I'd hate to subject a crowd to just
> this issue for an hour :) ).
>
> My concern with this approach is that there is no benefit to the change
> at all. If we aggressively publicise the fact that libraries that don't
> handle Unicode paths on Windows properly are using deprecated
> functionality and need to be fixed by 3.7 in order to avoid breaking
> (more precisely - continuing to be broken, but with a different error
> message), then we will alienate non-Windows developers further from the
> platform (net loss for the ecosystem) and convince some to switch to str
> everywhere (net gain for the ecosystem). It doesn't
>
> For those who listen and change to str, it removes the need to make any
> change in 3.7 at all, so we would really just be making noise about
> something that some people may not have noticed without necessarily
> going in and fixing anything. For those who don't listen, the change in
> 3.7 is going to break them just as much as if we made the change in 3.6.
>
>
> Timing #3: Change it in 3.8
>
> This timing assumes that the deprecation in 3.3 was not sufficient and
> we need to start a new deprecation cycle. This is strengthened by the
> fact that the deprecation announcement does not explicitly include the
> io module or the builtin open() function, and so some developers may
> believe that using bytes for paths with these is okay despite the os
> module being deprecated.
>
> The one upside to this approach is that it would also allow us to change
> locale.getpreferredencoding() to utf-8 on Windows (to affect the default
> behaviour of open(..., 'r') ), which I don't believe is going to be
> possible without a new deprecation cycle. There is a strong argument
> that the following code should also round-trip regardless of platform:
>
>  >>> with open('list.txt', 'w') as f:
> ...     for i in os.listdir('.'):
> ...         print(i, file=f)
> ...
>  >>> with open('list.txt', 'r') as f:
> ...     files = list(f)
> ...
>
> Currently, the default encoding for open() cannot represent all
> filenames that may be returned from listdir(). This may affect makefiles
> and configuration files that contain paths. Currently they will work
> correctly for paths that can be represented in the machine's active code
> page (though it should be noted that the *A APIs may be changed in a
> process by user/machine configuration to use the OEM code page rather
> than the active code page, which would potentially lead to encoding
> issues even for CP_ACP compatible names).
>
> Possibly resolving both issues simultaneously is worth waiting for two
> more releases? I'm not convinced the change to getfilesystemencoding()
> needs to wait for getpreferredencoding() to also change, or that they
> necessarily need to match, but it would not be hugely surprising to see
> the changes bundled together.
>
> I'll also note that there has been limited discussion about changing
> getpreferredencoding() so far, though there have been a number of "+1"
> votes alongside some "+1 with significant concerns" votes. Changing the
> default encoding of the contents of data files is pretty scary, so I'm
> not in any rush to force it in. On the other hand, changing the encoding
> for paths without changing the default encoding for text files may break
> "bytes in, bytes through, bytes out" for some files (especially
> makefiles and .ini files). Arguably this idea was already deprecated
> with Python 3's bytes/text separation anyway.
>
> Acknowledgements
> ================
>
> Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for
> their significant contributions and willingness to engage, and to
> everyone else on python-ideas for contributing to the discussion.
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160819/2838883e/attachment-0001.html>