[Python-Dev] file system path protocol PEP

Sat May 14 02:56:30 EDT 2016

Chris Angelico writes:

 > AFAICT, the compatibility layer would simply decode the bytes using
 > surrogateescape handling, which should round-trip anything.

By design.  See PEP 383.  Or rather, the OP should; he has not done
his homework and is confused by his own FUD.  This whole subthread is
really python-list territory.

Since a lot of people I respect seem uncertain about the facts, for
the record, let's lay out the (putative) issues remaining for
post-PEP-383 Python vs. str-y path objects.

(0) "Can't work with some POSIX (bytes) paths" is closed by PEP
    383, forget it.  os.fsdecode(bytespath) as soon as you get one,
    os.fsencode(strpath) just before you need one, done.  Surrogates
    embedded in strpath may need special handling depending on the
    application (see (1)).

(1) str.encode(errors='strict') (the default) will blow up on embedded
    surrogates.  Yes, but that's a *good* thing if you're mixing str
    derived from filesystem paths with other text.  There's no way to
    avoid it.  If you're just passing it back to open(), it Just
    Works, done.

(2) You're using bytes as text a la 2.x for "efficiency's" sake, and
    you're worried that you might pass a str-y Path deep into bytes
    territory and it will explode there.

    I don't think there is any sympathy left for that use case on
    Python dev channels.  Define a clear boundary with well-defined
    entry and exit gates, and convert there.  Then you can get some
    sleep.  (How-to example: your "compatibility layer".)

(3) You're worried about inefficiency of decoding/encoding to the same
    or trivially changed bytes (ie, you didn't need pathlib in the
    first place, but you got it anyway) -- this especially matters for
    2.7, but is significant for 3.x too, if you're using a bunch of
    paths in a tight loop.

    I don't have sympathy for that use case, but Brett and Guido do,
    and Brett's PEP handles it by making __fspath__ polymorphic in the
    usual os.path-y way, with Guido's modification.

    This is always a tradeoff.  If you know your JPEGs all have
    extension '.JPG' and

        png_path = jpeg_path[:-4] + b'.png'

    is readable enough for you, use that, not pathlib or Antipathy,
    and you get your efficiency.  (Doing jpeg_path.rindex(b'.') is left
    as an exercise for the reader.  Part (i): Is it really worth it?)

    If you want the readability of a rich path library and the
    efficiency of bytes, you *may* have the option of using Ethan's
    Antipathy (or whatever).

    If you can't use Antipathy, use bytes methods directly, or accept
    that it isn't *that* inefficient and use pathlib.  At this point,
    I think this subcase is just FUD, no real examples were presented
    where the efficiency hit of encoding/decoding gets in the way of
    getting work done using pathlib.

    If you need to stick to stdlib for some reason (eg, to use a
    higher-level library that uses pathlib), live with the
    "compatibility layer"'s inefficiency.  Decoding and encoding are
    actually rather low-cost operations at path lengths (PATHMAX=256
    was common, not so long ago!).  Most high-level libraries will
    impose a lot more overhead elsewhere, and calling into pathlib by
    itself will add a certain amount of overhead as well.

(4) Lack of transparency/readability for "simple" operations.  If
    Antipathy is something you can use, I agree it's plausible that
    avoiding a few os.fsdecode and os.fsencode calls would look nicer,
    but this is really a style question.

    My take: I think of paths as human-readable, so presenting them as
    str (not bytes) is important to me, important enough that I
    advocate that position to other developers.  If you do the
    conversion at the boundary between a bytes-y module and pathlib
    ("compatibility layer") I don't see how it affects readability of
    the path manipulation code, while data marshaling at boundaries is
    a expected fact of software development.  YMMV.

(0) is thus a non-issue.  (1) is not something that can be addressed
by general principles, let alone language design.  (2)-(4) are all
real issues regardless of how I feel they should be resolved :-), but
they're all design trade-offs, not things that can completely block
you from getting some kinds of work done in your own style (eg, the
situation str-minded people were in before PEP 383).

Python 3 is an example of how language design can help alleviate
issues like (2), by discouraging that use case in various ways.
Brett's PEP is an example of how language design can help alleviate
issues like (3) and (4).  In particular, it helps us to interface
pathlib to open() and friends in a very natural, readable way, without
explicit conversions that should be unnecessary by the nature of the
operation and its arguments.  By contrast, the conversion of bytes to
str is important to do explicitly because they are different
representations of the same thing, and it's important that readers be
notified of that change of representation.

 > Or am I wrong here somewhere?

Well, considering the length of this irrelevant-to-the-PEP subthread,
arguably you are feeding a successful troll.  I hope that having
posted the above, in the future there will be *one*, *short* reply to
such questions:

		    Not a problem.  Read PEP 383.

and the thread will end there.

Steve