[Python-ideas] Fix default encodings on Windows

Wed Aug 10 15:22:20 EDT 2016

On 10Aug2016 1146, Random832 wrote:
> On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
>> To summarise the proposals (remembering that these would only affect
>> Python 3.6 on Windows):
>>
>> * change sys.getfilesystemencoding() to return 'utf-8'
>> * automatically decode byte paths assuming they are utf-8
>> * remove the deprecation warning on byte paths
>
> Why? What's the use case?

Allowing library developers who support POSIX and Windows to just use 
bytes everywhere to represent paths.

>> * make the default open() encoding check for a BOM or else use utf-8
>> * [ALTERNATIVE] make the default open() encoding check for a BOM or else
>> use sys.getpreferredencoding()
>
> For reading, I assume. When opened for writing, it should probably be
> utf-8-sig [if it's not mbcs] to match what Notepad does. What about
> files opened for appending or updating? In theory it could ingest the
> whole file to see if it's valid UTF-8, but that has a time cost.

Writing out the BOM automatically basically makes your files 
incompatible with other platforms, which rarely expect a BOM. By 
omitting it but writing and reading UTF-8 we ensure that Python can 
handle its own files on any platform, while potentially upsetting some 
older applications on Windows or platforms that don't assume UTF-8 as a 
default.

> Notepad, if there's no BOM, checks the first 256 bytes of the file for
> whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
> and can get it wrong for certain very short files [i.e. the infamous
> "this app can break"]

Yeah, this is a pretty horrible idea :) I don't want to go there by 
default, but people can install chardet if they want the functionality.

> What to do on opening a pipe or device? [Is os.fstat able to detect
> these cases?]

We should be able to detect them, but why treat them any differently 
from a file? Right now they're just as broken as they will be after the 
change if you aren't specifying 'b' or an encoding - probably more 
broken, since at least you'll get less encoding errors when the encoding 
is UTF-8.

> Maybe the BOM detection phase should be deferred until the first read.
> What should encoding be at that point if this is done? Is there a
> "utf-any" encoding that can handle all five BOMs? If not, should there
> be? how are "utf-16" and "utf-32" files opened for appending or updating
> handled today?

Yes, I think it would be. I suspect we'd have to leave the encoding 
unknown until the first read, and perhaps force it to utf-8-sig if 
someone asks before we start. I don't *think* this is any less 
predictable than the current behaviour, given it only applies when 
you've left out any encoding specification, but maybe it is.

It probably also entails opening the file descriptor in bytes mode, 
which might break programs that pass the fd directly to CRT functions. 
Personally I wish they wouldn't, but it's too late to stop them now.

>> * force the console encoding to UTF-8 on initialize and revert on
>> finalize
>
> Why not implement a true unicode console? What if sys.stdin/stdout are
> pipes (or non-console devices such as a serial port)?

Mostly because it's much more work. As I mentioned in my other post, an 
alternative would be to bring win_unicode_console into the stdlib and 
enable it by default (which considering the package was largely 
developed on bugs.p.o is probably okay, but we'd probably need to 
rewrite it in C, which is basically implementing a true Unicode console).

You're right that changing the console encoding after launching Python 
is probably going to mess with pipes. We can detect whether the streams 
are interactive or not and adjust accordingly, but that's going to get 
messy if you're only piping in/out and stdin/out end up with different 
encodings. I'll put some more thought into this part.

Thanks,
Steve