[Python-ideas] Fix default encodings on Windows

Wed Aug 17 09:37:32 EDT 2016

On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull
<turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:
> BTW, why "surrogate pairs"?  Does Windows validate surrogates to
> ensure they come in pairs, but not necessarily in the right order (or
> perhaps sometimes they resolve to non-characters such as U+1FFFF)?

A program can pass the filesystem a name containing one or more
surrogate codes that isn't in a valid UTF-16 surrogate pair (i.e. a
leading code in the range D800-DBFF followed by a trailing code in the
range DC00-DFFF). In the user-mode runtime library and kernel
executive, nothing up to the filesystem driver checks for a valid
UTF-16 string. Microsoft's filesystems remain compatible with UCS2
from the 90s and don't care that the name isn't legal UTF-16. The same
goes for the in-memory filesystems used for named pipes (NPFS,
\\.\pipe) and mailslots (MSFS, \\.\mailslot). But non-Microsoft
filesystems don't necessarily store names as wide-character strings.
They may use UTF-8, in which case an invalid UTF-16 name will cause
the system call to fail because it's an invalid parameter.

If the filesystem allows creating such a  badly named file or
directory, it can still be accessed using a regular unicode path,
which is how things stand currently. I see that Victor has suggested
using "surrogatepass" in issue 27781. That would allow seamless
operation. The downside is that bytes have a higher chance of leaking
out of Python than strings created by 'surrogateescape' on Unix. But
since it isn't a proper Unicode string on disk, at least nothing has
changed substantively by transcoding to "surrogatepass" UTF-8.