[Python-Dev] pathlib - current status of discussions

Tue Apr 12 11:52:43 EDT 2016

Nick Coghlan writes:

 > One possible way to address this concern would be to have the
 > underlying protocol be bytes/str (since boundary code frequently
 > needs to handle the paths-are-bytes assumption in POSIX),

What "needs"?  As has been pointed out several times, with PEP 383 you
can deal with bytes losslessly by using an arbitrary codec and
errors=surrogateescape.  I know why *I* use bytes nevertheless:
because when I must guess the encoding, it just makes more sense to
read bytes and then iterate over codecs until the result looks like
words I know in some language.

I don't understand why people who mostly believe "bytes are text, too"
because almost all they ever see are bytes in the range 0x00-0x7f need
bytes.  For them, fsdecode and fsencode DTRT.

If you want to claim "efficiency", I can't gainsay since I don't know
the applications, but if you're trying to manipulate file names
millions of times per second, I have to wonder what you're doing with
them that benefits so much from Path.

 > but offer an "os.fspathname" API that rejected bytes output from
 > os.fspath.

Either it's a YAGNI because I'm not going to get any bytes in the
first place, or it raises where I probably could have done something
useful with bytes if I were expecting them (see "pathological" below).

 > That way folks that wanted the clean "must be str" signature

Er, I don't need no steenkin' "clean signature".  I need str, and if
I can't get it from __fspath__, there's always os.fsdecode.  But this
is serious horse-before cart-putting, punishing those who do things
Python-3-ishly right.

 > The ambiguity in question here is inherent in the differences between
 > the way POSIX and Windows work,

Not with PEP 383, it's not.  And I don't do Windows, so my preference
for str has nothing to do with it mapping to native OS APIs well.

The ambiguity in question here is inherent in the differences between
the ways Python 2 and Python 3 programmers work on POSIX AFAICS.
Certainly, there will be times when fsdecode doesn't DTRT.  So those
times you have to use an explicit bytes.decode.  Note that when you
*do* care enough to do that, it's because the Path is *text* -- you're
going to display it to a human, or pass it out of the module.  If all
you're going to do is access the filesystem object denoted, fsdecode
does a sufficiently accurate job.

So if for some reason you're getting bytes at the boundary, I see no
reason why you can't have a convenience constructor

def pathological(str_or_bytes_or_path_seq):
    args = []
    for s_o_b in str_or_bytes_or_path_seq:
        args.append(os.fsdecode(s_o_b) if isinstance(s_o_b, bytes) else s_o_b)
    return pathlib.Path(str_or_path_list)

for when that's good enough (maybe Antoine would even allow it into
pathlib?)

 > so there are limits to how far we can go in hiding it without
 > making things worse rather than better.

What "hide"?  Nobody is suggesting that the polymorphic os APIs should
go away.  Indeed, they are perfect TOOWTDI, giving the programmer
exactly the flexibility needed *and no more*, *at* the boundary.

The questions on my mind are:

(A) Why does anybody need bytes out of a pathlib.Path (or other
    __fspath__-toting, higher-level API) *inside* the boundary?  Note
    that the APIs in os (etc) *don't need* bytes because they are
    already polymorphic.

(B) If they do, why can't they just apply bytes() to the object?  I
    understand that that would offend Ethan's aesthetic sense, so it's
    worth looking for a nice way around it.  But allowing __fspath__
    to return bytes or str is hideous, because Paths are clearly on
    the application side of the boundary.

Note that bytes() may not have the serious problem that str() does of
being too catholic about its argument: nothing in __builtins__ has a
__bytes__!  Of course there are a few things that do work: ints, and
sequences of ints.