[Python-Dev] Windows: Remove support of bytes filenames in theos module?

Wed Feb 10 03:00:17 EST 2016

Executive summary:

Code pages and POSIX locales aren't solutions, they're the Original Sin.

Steve Dower writes:
 > On 09Feb2016 2017, Stephen J. Turnbull wrote:

 > >   > The problem here is the protocol that Python uses to return
 > >   > bytes paths, and that protocol is inconsistent between APIs
 > >   > and information is lost.
 > >
 > > No, the problem is that the necessary information simply isn't always
 > > available.
 > 
 > But if we return bytes paths and the user passes them back in unchanged, 
 > that should be irrelevant.

Yes.  That's pretty much exactly the semantics of using the latin-1
codec.  UTF-8 can't do that without surrogateescape, which Python 2 lacks.

 > The earlier issue was that that doesn't work (e.g. a bytes path
 > from os.scandir couldn't be passed back into open()).

My purely-from-the-user-side take is that that's just a bug in
os.scandir that should be fixed, and that even though the complexity
that occasions such bugs is an undesirable aspect of Python (v2)
programming, it's not a bug because it *can't* be fixed -- you have to
fix the world, not Python.  Or switch to Python 3.

I don't know enough to have an opinion on whether "fixing" os.scandir
could cause other problems.

 > I meant with Python's calls into the API. Anywhere Python does the 
 > conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance 
 > it'll be wrong.

Indeed.  That's why converting the bytes is often the wrong thing to
do *period*.  The reasons that Python might be wrong apply to every
agent that might decide the conversion -- except the user; the user is
never wrong about these things.

 > Microsoft's solution here is the user's active code page, much like 
 > *nix's solution as I understand it, except that where *nix will convert 
 > _to_ the encoding as a normalized form, Windows will convert _from_ the 
 > encoding to its UTF-16 "normalized" form.

Not quite accurate.  Unix by original design doesn't *have* a
normalized form.[1] Bytez-iz-bytez-R-Us, that's Unix.  Recently
everybody (except for a few nationalist lunatics and the unteachables
in some legislatures) has learned that some form of Unicode is the way
to go internally.  But that's "best practice", not POSIX requirement,
and tons of software continues to operate[2] based on the assumption
that users are monolingual with a canonical one-byte encoding, so it
doesn't matter as long as *no conversion is ever done*, and the input
methods and fonts are consistent with each other.  Code pages just try
to *enforce* that constraint (and as I already mentioned, that pissed
me off so much in 1990 that I'm still a Windows refusenik today).

 > Back-compat concerns have prevented any significant changes being
 > made here, otherwise there wouldn't be a 'bytes' interface at
 > all.

It's not just back-compat, it's absolutely necessary in a code-page-
based world because you just can't be sure what encoding your content
is in until the user tells you the crap you've spewed on her screen
might be Klingon, but it's not any of the 7 human languages she knows.
"Toto!  I don't think we're in Kansas any more...."  The fact is that
code-page-based content continues to be produced in significant
quantities, despite the universal availability and absolute
superiority (except for workstation reconfiguration costs) of Unicode.

Footnotes: 
[1]  The POSIX locale selects encodings for console input and output.
File I/O is just bytes, both the content and the file name.  The code
page also defines the file name encoding as I understand it.

[2]  I would hope that nobody is *writing* software like that any
more, but I live in Japan.  That hope is years in the future for me.