[Python-Dev] Pathlib enhancements - acceptable inputs and outputs for fspath and os.fspath()

Mon Apr 18 15:25:16 EDT 2016

I don't disagree with the basic analysis, but there are a number of
issues with motivational statements.

Koos Zevenhoven writes:

 > (B) "str-based only"
 > *Accept*: str, provided via __fspath__ as well as plain str.
 > *Return*: str.
 > *Audience*: relatively low-level code that works exclusively with str
 > paths but accepts specialized path objects as input.

Why "low-level"?  All code that stores paths persistently is likely to
store them in text files or database strings or the like, rather than
as Path (read: specialized path objects, not necessarily
pathlib.Path).  But if there is any low-level manipulation of the
paths to be done before storing, it would be done as Path.  Thus
high-level code might also want to accept Path transparently.

 > (C) "bytes-based only"
 > *Accept*: bytes, provided via __fspath__ as well as plain bytes.
 > *Return*: bytes.
 > *Audience*: low-level code that explicitly deals with paths as bytes
 > (probably to deal with undefined/ill-defined encodings).

No, if it's to deal with encoding issues, we wouldn't accept this.
PEP 383 eliminates that concern.  We accept bytes to support people
who are representing paths with bytes because they think that it's a
good idea and that encoding doesn't matter in their application.

 > (D) "coerce to str"
 > *Accept*: str and bytes, provided via __fspath__ as well as plain str
 > and bytes instances.
 > *Return*: str (coerced / decoded if needed).
 > *Audience*: code that deals explicitly with str but wants to 'try'
 > supporting bytes-based path inputs too via implicit decoding (even if
 > it may result in surrogate escapes, which one cannot for instance
 > print(...).)

No.  As Nick points out with respect to fsencode/fsdecode, it's not
a question of supporting known bytes via implicit decoding (that's
what __fspath__ does for the types that support it), but rather
of supporting ambiguity.  Best practice is to convert explicitly at
the boundary, because it's too likely that data with unexpected type
is just the wrong data.  

Printing surrogates can be done with errors=backslashreplace, and if
you're using fsdecode, you probably should use that, namereplace, or
xmlcharrefreplace.

 > (E) "coerce to bytes"
 > *Accept*: str and bytes, provided via __fspath__ as well as plain str
 > and bytes instances.
 > *Return*: bytes (coerced / encoded if needed).
 > *Audience*: low-level code that explicitly deals with bytes paths but
 > wants to accept str-based path inputs too via implicit encoding.

Again, it's a question of ambiguity, or perhaps sloppy programming
(eg, using str literals for paths in a bytes-oriented program).

Use cases D and E are basically "guessing when faced with ambiguity",
and fsencode and fsdecode are code smells because (as Nick claims)
they almost always conceal a situation where you don't know whether
you've got bytes or str (and it's way too much work to find out by
tracing them back to where they came from).

 > It seems to me we now "all" agree that __fspath__ should allow
 > str+bytes polymorphism.

I don't agree that we *should* allow polymorphism, because (purity)
paths are in the text domain[1] and (practicality) I don't believe that
use of os.fspath will be restricted to "low-level boundary code".  I
would be perfectly happy telling bytes users that the idiom is not
"os.fspath(maybe_direntry, allow_types=(bytes,))", but rather
"os.fsencode(os.fspath(maybe_direntry))", so that code in the text
domain can safely use os.fspath(maybe_direntry) without worrying that
it will raise because maybe_direntry.__fspath__() returns bytes.

This would allow pathlib.Path to handle arguments providing __fspath__
transparently.  With the current proposal, it would need to rule out
bytes before invoking os.fspath, or handle the exception, or leave the
exception to its caller.  None of these options are pleasant.

Unfortunately, as Nick points out, defining __fspath__ to return str
is very unpleasant because bytes applications will now have to guard
*everything* that might provide __fspath__ with that incantation
before passing to open and other APIs that store the path on the
object returned.  So we don't really have a choice about polymorphism
if we want to support both __fspath__ and bytes paths.

 > After all, we want something that's *almost* exclusively str.

But we don't want that, AFAICT.  Some clearly want this API to be
unbiased against bytes in the same way the os APIs are unbiased[2],
because that's what we've got in the current proposal.  Further, due
to the existing ambiguity in fsencode and fsdecode, we're extending
the field of ambiguity where bytes and str can mix indiscriminately.

If we are serious about "*almost* exclusively str" we should accept
that "exclusively str" is a very good approximation and much easier to
use correctly, and regretfully postpone inclusion of DirEntry in this
protocol to the future.  But that's not on the table, is it?

Footnotes: 
[1]  Representation on disk as (basically unconstrained) byte
sequences is an historical accident.

[2]  That doesn't mean the bytes variants will be used as often as the
str variants, just that the bytes variants are as easy to use. 

[Python-Dev] Pathlib enhancements - acceptable inputs and outputs for __fspath__ and os.fspath()

[Python-Dev] Pathlib enhancements - acceptable inputs and outputs for fspath and os.fspath()