[Python-ideas] Fix default encodings on Windows

Wed Aug 10 16:09:05 EDT 2016

On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
> > Why? What's the use case? [byte paths]
> 
> Allowing library developers who support POSIX and Windows to just use 
> bytes everywhere to represent paths.

Okay, how is that use case impacted by it being mbcs instead of utf-8?

What about only doing the deprecation warning if non-ascii bytes are
present in the value?

> > For reading, I assume. When opened for writing, it should probably be
> > utf-8-sig [if it's not mbcs] to match what Notepad does. What about
> > files opened for appending or updating? In theory it could ingest the
> > whole file to see if it's valid UTF-8, but that has a time cost.
> 
> Writing out the BOM automatically basically makes your files 
> incompatible with other platforms, which rarely expect a BOM.

Yes but you're not running on other platforms, you're running on the
platform you're running on. If files need to be moved between platforms,
converting files with a BOM to without ought to be the responsibility of
the same tool that converts CRLF line endings to LF.

> By 
> omitting it but writing and reading UTF-8 we ensure that Python can 
> handle its own files on any platform, while potentially upsetting some 
> older applications on Windows or platforms that don't assume UTF-8 as a 
> default.

Okay, you haven't addressed updating and appending. I realized after
posting that updating should be in binary, but that leaves appending.
Should we detect BOMs and/or attempt to detect the encoding by other
means in those cases?

> > Notepad, if there's no BOM, checks the first 256 bytes of the file for
> > whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
> > and can get it wrong for certain very short files [i.e. the infamous
> > "this app can break"]
> 
> Yeah, this is a pretty horrible idea :) 

Eh, maybe the utf-16 because it can give some hilariously bad results,
but using it to differentiate between utf-8 and mbcs might not be so
bad. But what to do if all we see is ascii?

> > What to do on opening a pipe or device? [Is os.fstat able to detect
> > these cases?]
> 
> We should be able to detect them, but why treat them any differently 
> from a file?

Eh, I was mainly concerned about if the first few bytes aren't a BOM?
What about blocking waiting for them? But if this is delayed until the
first read then it's fine.

> It probably also entails opening the file descriptor in bytes mode, 
> which might break programs that pass the fd directly to CRT functions. 
> Personally I wish they wouldn't, but it's too late to stop them now.

The only thing O_TEXT does rather than O_BINARY is convert CRLF line
endings (and maybe end on ^Z), and I don't think we even expose the
constants for the CRT's unicode modes.