[Python-Dev] Pathlib enhancements - acceptable inputs and outputs for __fspath__ and os.fspath()

Victor Stinner victor.stinner at gmail.com
Mon Apr 11 19:43:16 EDT 2016


Le 11 avr. 2016 11:11 PM, "Ethan Furman" <ethan at stoneleaf.us> a écrit :
> So my concern in such a case is what happens if we pass this SE string
somewhere else: a UTF-8 file, or over a socket, or into a database? Does
this have issues that we wouldn't face if we just used bytes?

"SE string" are returned by os.listdir(str), os.walk(str), os.getenv(str),
sys.argv[int], ... since Python 3.3. Nothing new under the sun.

Trying to encode a surrogate to ascii, latin1 or utf8 raise an encoding
error. A surrogate is created to store an undecodable byte in a filename.

IHMO it's safer to get an encoding error rather than no error when you
concatenate two byte strings encoded to two different encodings (mojibake).

print(os.fspath(obj)) will more likely do what you expect if os.fspath()
always return str. I mean that it will encode your filename to the encoding
of the terminal which can be different than the filesystem encoding.

If fspath() can return bytes, you should write
print(os.fsdecode(os.fspath(obj))).

--

On Linux, open(DirEntry) for a bytes entry (os.scandir(bytes)) would have
to first decode a bytes filename with os.fsdecode() to then encode it back
with os.fsencode().

Yeah, that's inefficient. But we now have super fast codecs (ex: encode and
decode is almost memcpy for pure ascii). And filenames are usually very
short (less than 300 bytes). IMHO the interface matters more than
performance.

As I showed with my print example, filenames are not only used to access
the filesystem, you also want to display them. Using Unicode avoids bad
surprises (mojibake).

--

Well, the question is more why you want to get bytes at the first place.
Why not only using Unicode?

I understood that some people expect mojibake when using Unicode, whereas
using bytes cannot lead to mojibake. Well, in practice it's simply the
opposite :-)

Maybe devs read that Linux syscalls and C functions take bytes, so using
bytes give access to any filenames including "invalid filenames". That's
true. But it's also true for Unicode if you use os.fsdecode().

Maybe dev don't understand, don't know and fear Unicode :-)

My goal is more to educate users and help them to avoid mojibake.

Did I mention that you must not use bytes filename on Windows? So using
Unicode everywhere helps to write really portable code. On Windows, using
Unicode is requied to be able to open any file.

Victor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160412/723d2279/attachment.html>


More information about the Python-Dev mailing list