[Python-ideas] Fix default encodings on Windows

Sat Aug 20 04:18:18 EDT 2016

Chris Barker writes:

 > Sure -- but it's entirely unnecessary, yes? If you don't change
 > your code, you'll get py2(bytes) strings as paths in py2, and py3
 > (Unicode) strings as paths on py3. So different, yes. But wouldn't
 > it all work?

The difference is that if you happen to have a file name on Unix that
is *not* encoded in the default locale, bytes Just Works, while
Something Bad happens with unicode (mixing Python 3 and Python 2
terminology for clarity).  Also, in Python the C/POSIX default locale
implied a codec of 'ascii' which is quite risky nowadays, so using
unicode meant always being conscious of encodings.

 > So folks are making an active choice to change their code to get some
 > perceived (real?) performance benefit???

No, they're making a passive choice to not fix whut ain't broke nohow,
but in Python 3 is spelled differently.  It's the same order of
change as "print stuff" (Python 2) to "print(stuff)" (Python 3),
except that it's not as automatic.  (Ie, where print is *always* a
function call in Python 3, often in a Python 2 -> 3 port you're better
off with str than bytes, especially before PEP 461 "% formatting for
bytes".)

 > However, as I understand it, py3 string paths did NOT "just work"
 > in place of py2 paths before surrogate pairs were introduced (when
 > was that?)

I'm not sure what you're referring to.  Python 2 unicode and Python 3
str have been capable of representing (for values of "representing"
that require appropriate choice of I/O codecs) the entire repertoire
of Unicode since version 1.6 [sic!].  I suppose you mean PEP 383
(implemented in Python 3.1), which added a pseudo-encoding for
unencodable bytes, ie, the surrogateescape error handler.

This was never a major consideration in practice, however, as you
could always get basically the same effect with the 'latin-1' codec.
That is, the surrogateescape handler is primarily of benefit to those
who are already convinced that fully conformant Unicode is the way to
go.  It doesn't make a difference to those who prefer bytes.

 > What I'm getting at is whether there is anything other than inertia
 > that keeps folks using bytes paths in py3 code? Maybe it wouldn't
 > be THAT hard to get folks to make the switch: it's EASIER to port
 > your code to py3 this way!

It's not.  First, encoding awareness is real work.  If you try to
DTRT, you open yourself up to UnicodeErrors anywhere in your code
where there's a Python/rest-of-world boundary.  If you just use bytes,
you may be producing garbage, but your program doesn't stop running,
and you can always argue it's either your upstream's or your
downstream's fault.  I *personally* have always found the work to be
worthwhile, as my work always involves "real" text processing, and
frequently not in pure ASCII.

Second, there are a lot of low-level use cases where (1) efficiency
matters and (2) all the processing actually done involves switching on
byte values in the range 32-126.  It makes sense to do that work on
bytes, wouldn't you say?<wink/>  And to make the switch cases easier
to read, it's common practice to form (or contort) those bytes into
human words.

These cases include a lot of the familiar acronyms: SMTP, HTTP, DNS,
VCS, VM (as in "bytecode interpreter"), ... and the projects are
familiar: Twisted, Mercurial, ....

Bottom line: I'm with you!  I think that "filenames are text" *should*
be the default mode for Python programmers.  But there are important
use cases where it's sometimes more work to make that work than to
make bytes work (on POSIX), and typically those cases also inherit
largish, battle-tested code bases that assume a "bytes in, bytes
through, bytes out" model.  We can't deprecate "filenames as bytes" on
POSIX yet, and if we want to encourage participation in projects that
use that model by Windows-based programmers, we can't deprecate
completely on Windows, either.