[Python-Dev] Bytes path support

Sat Aug 23 15:41:22 CEST 2014

On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano <steve at pearwood.info> wrote:
> When I started this email, I originally began to say that the actual 
> problem was with byte file names that cannot be decoded into Unicode 
> using the system encoding (typically UTF-8 on Linux systems. But I've 
> actually had difficulty demonstrating that it actually is a problem. I 
> started with a byte sequence which is invalid UTF-8, namely:
> 
> b'ZZ\xdb\xdf\xfa\xff'
> 
> created a file with that name, and then tried listing it with 
> os.listdir. Even in Python 3.1 it worked fine. I was able to list the 
> directory and open the file, so I'm not entirely sure where the problem 
> lies exactly. Can somebody demonstrate the failure mode?

The "failure" happens only when you try to cross from the domain of
posix binary filenames into the domain of text streams (that is, a
stream with a consistent encoding).  If you stick with os interfaces
that handle filenames, Python3 handles posix bytes filenames just fine
(though there may be a few corner-case rough edges yet to be fixed, and
the standard streams was one of them).

The difficultly comes if you try to use a filename that contains
undecodable bytes in a non-os-interface text-context (such as writing it
to a text file that you have declared to be a utf-8 encoding): there you
will get an error...not completely unlike the old "your code works until
your user uses unicode" problem we had in python2, but in this case only
happening in a very narrow set of circumstances involving trying to
translate between one domain (posix binary filenames) and another domain
(io streams with a consistent declared encoding).  This is not a common
operation, but appears to be the one Oleg is concerned about. The old
unicode-blowup errors would happen almost any time someone with a
non-ascii language tried to use a program written by an ascii-only
programmer (which was most of us).

The same problem existed in python2 if your goal was to produce a stream
with a consistent encoding, but now python3 treats that as an error.  If
you really want a stream with an inconsistent encoding, open it as
binary and use the surrogate escape error handler to recover the bytes
in the filenames.  That is, *be explicit* about your intentions.

So yes, we've shifted a burden from those who want non-ascii text to
work consistently to those who wanted inconsistently encoded text to "just
work" (or rather *appear* to "just work").  The number of people who
benefit from the improved text model *greatly* outweighs the number of
people inconvenienced by the new strictness when the domain line (posix
binary filenames to consistently encoded text stream) are crossed.  And
the result is more *valid* programs, and fewer unexpected errors
overall, with no inconvenience unless that domain line is crossed,
and even then the inconvenience is limited to the open call that creates
the binary stream.

--David