[Python-Dev] File system path encoding on Windows

Tue Aug 23 12:08:00 EDT 2016

I've trimmed fairly aggressively for the sake of not causing the rest of 
the list to mute our discussion (again :) ). Stephen - feel free to 
email me off list if I go too far or misrepresent you.

As a summary for people who don't want to read on (and Stephen will 
correct me if I misquote):

* we agree on removing use of the *A APIs within Python, which means 
Python will have to decode bytes before passing them to the operating system
* we agree on allowing users to switch the encoding between utf-8 and 
mbcs:replace (the current default)
* we agree on making utf-8 the default for 3.6.0b1 and closely 
monitoring the reaction
* Stephen sees "no reason not to change locale.getpreferredencoding()" 
(default encoding for open()) at the same time with the same switches, 
while I'm not quite as confident. Do users generally specify an encoding 
these days? I know I always put utf-8 there.

Does anyone else have concerns or questions?

On 22Aug2016 2121, Stephen J. Turnbull wrote:
> UTF-8 is absolutely not equivalent to UTF-16 from the point of view of
> developers. Passing it to Windows APIs requires decoding to UTF-16 (or
> from a Python developer's point of view, decoding to str and use of
> str APIs).  That fact is what got you started on this whole proposal!

As encoded bytes, that's true, but as far as correctly encoding text, 
they are equivalent.

>  > All MSVC users have been pushed towards Unicode for many years.
>
> But that "push" is due to the use of UTF-16-based *W APIs and
> deprecation of ACP-based *A APIs, right?  The input to *W APIs must be
> decoded from all text/* content "out there", including UTF-8 content.
> I don't see evidence that users have been pushed toward *UTF-8* in that
> statement; they may be decoding from something else.  Unicode != UTF-8
> for our purposes!

Yes, the operating system pushes people towards *W APIs, and the 
languages commonly used on that operating system follow.

Windows has (for as long as it matters) always been UTF-16 for paths and 
bytes for content. Nowhere does the operating system tell you how to 
read your text file except as raw bytes, and content types are meant to 
provide the encoding information you need. Languages each determine how 
to read files in "text" mode, but that's not bound to or enforced by the 
operating system in any way.

>  > The .NET Framework has defaulted to UTF-8
>
> Default != enforce, though.  Do you know that almost nobody changes
> the default, and that behavior is fairly uniform across different
> classes of organization (specifically by language)?  Or did you mean
> "enforce"?

This will also not enforce anything that the operating system doesn't 
enforce. Windows uses Unicode to represent paths and requires them to be 
passed as UTF-16 encoded bytes. If you don't do that, it'll convert for 
you. My proposal is for Python to do the conversion instead.

(In .NET, users have to decode a byte array if they want to get a 
string. There aren't any APIs that take byte[] as if it were text, so 
it's basically the same separation between bytes/str that Python 3 
introduced, except without any allowance for bytes to still be used in 
places where text is needed.)

> To be clear: asking users who want backward-compatible behavior to set
> an environment variable does not count as a "screw" -- some will
> complain, but "the defaults always suck for somebody".  Reasonable
> people know that, and we can't do anything about the hysterics.

Good. Glad we agree on this.

> 1.  Organizations which behave like ".NET users" already have pure
>     UTF-8 environments.  They win from Python defaulting to UTF-8,
>     since Windows won't let them do it for themselves.  Now they can
>     plug in bytes-oriented code written for the POSIX environment
>     straight from upstream.
>
>     Is that correct?  Ie, without transcoding, they can't now use
>     bytes because their environment hands them UTF-8 but when Python
>     hands those bytes to Windows, it assumes anything else but UTF-8?

If you give Windows anything but UTF-16 as a path, it will convert to 
UTF-16. The change is to convert to UTF-16 ourselves, so Windows will 
never see the original bytes. To do that conversion, we need to know 
what encoding the incoming bytes are encoded with.

Python users will either transcode from bytes in encoding X to str, 
transcode from bytes in encoding X to bytes in UTF-8, or keep their 
bytes in UTF-8 if that's how they started.

(I feel like there's some other misunderstanding going on here, because 
I know you understand how encoding works, but I can't figure out what it 
is or what I need to say to trigger clarity. :( )

Windows does not support using UTF-8 encoded bytes as text. UTF-16 is 
the universal encoding. (Basically the only thing you can reliably do 
with UTF-8 bytes in the Windows API is convert them to UTF-16 - see the 
MultiByteToWideChar function. Everything else just treats it like a blob 
of meaningless data.)

> BTW, I wonder how those organizations manage to get pure UTF-8
> environments, given that Windows itself won't default to that.  Is it
> just that they live in .NET and other applications that default to
> producing UTF-8 text (in the rare(?) case that text is generated at
> all, vs some application/* medium), and so never get near applications
> that produce text in the active code page, and especially not near
> applications that embed file system names encoded in a non-UTF-8
> encoding in text/* media?

I doubt they're pure UTF-8, but they pay attention to what files are 
encoded with and explicitly decode into a common internal encoding.

> Overall, I think Nick's hybrid strategy is the way to go.  First, give
> users the choice of 'mbcs' or 'utf-8' for the Windows encoding.  I see
> no reason not to do this for locale.getpreferredencoding() at the same
> time, as long as it's an option.

The thing about this is that it's always been an option (the encoding 
argument to open() et al.), and specifically, an option that's required 
on all platforms. So I see one reason to not do it - users can (and do) 
override it in a cross-platform compatible way.

The biggest difference from the file system encoding is that the 
encoding for file contents is entirely the business of the application 
(and whichever other applications it talks to), while the OS is the main 
recipient of file system encoded text and so it gets a say in the chosen 
encoding.

I'm happy for this to be on the table though, but *I* need convincing 
that it's a good idea to do it now.

Cheers,
Steve