[Python-Dev] Pathlib enhancements - acceptable inputs and outputs for __fspath__ and os.fspath()

Brett Cannon brett at python.org
Mon Apr 18 13:13:51 EDT 2016


On Sun, 17 Apr 2016 at 06:59 Koos Zevenhoven <k7hoven at gmail.com> wrote:

> On Sun, Apr 17, 2016 at 11:03 AM, Stephen J. Turnbull
> <stephen at xemacs.org> wrote:
> > Nick Coghlan writes:
> >
> >  > str and bytes aren't going to implement __fspath__ (since they're
> >  > only *sometimes* path objects), so asking people to call the
> >  > protocol method directly for any purpose would be a pain.
> >
> > It *should* be a pain.  People who need bytes should call fsencode,
> > people who need str should call fsdecode, and Ethan's antipathy checks
> > for bytes and str, then calls __fspath__ if needed.  Who's left?  Just
> > the bartender and the janitor, last call was hours ago.  OK, maybe
> > there are enough clients to make it worthwhile to provide the utility,
> > but it should be clearly marked as "double opt-in, for experts only
> > (consenting adults must show proof of insurance)".
>
> My doubts, expressed several times in these threads, about the need
> for a *public* os.fspath function to complement the __fspath__
> protocol, are now perhaps gone. I'll explain why (and how). The
> reasons for my doubts were that
>
> (1) The audience outside the stdlib for such a function should be
> small, because it is preferred to either use existing tools in
> os.path.* or pathlib (or similar) for manipulating paths.
>
> (2) There are just too many different possible versions of this
> function: rejecting str, rejecting bytes, coercion to str, coercion to
> bytes, and accepting both str and bytes. That's a total of 5 different
> cases. People also used to talk about versions that would not allow
> passing through objects that are already bytes or str. That would make
> it a total of 10 different versions!
> (in principle, there could be even more, but let's not go there :-).
> In other words, this argument was that it is probably best to
> implement whatever flavor is needed for the context, perhaps based on
> documented recipes.
>
>
> Regarding (2), we can first rule out half of the 10 cases---the ones
> that reject plain instances of bytes and/or str---because they would
> not be very useful as all the isinstance/hasattr checking etc. would
> be left to the caller. And here are the remaining five, explained
> based on what they accept as argument, what they return, and where
> they would be used:
>
> (A) "polymorphic"
> *Accept*: str and bytes, provided via __fspath__ as well as plain str
> and bytes instances.
> *Return*: str/bytes depending on input.
> *Audience*: the stdlib, including os.path.things, os.things,
> shutil.things, open, ... (some functions would need a C version).
> There may even be a small audience outside the stdlib.
>
> (B) "str-based only"
> *Accept*: str, provided via __fspath__ as well as plain str.
> *Return*: str.
> *Audience*: relatively low-level code that works exclusively with str
> paths but accepts specialized path objects as input.
>
> (C) "bytes-based only"
> *Accept*: bytes, provided via __fspath__ as well as plain bytes.
> *Return*: bytes.
> *Audience*: low-level code that explicitly deals with paths as bytes
> (probably to deal with undefined/ill-defined encodings).
>
> (D) "coerce to str"
> *Accept*: str and bytes, provided via __fspath__ as well as plain str
> and bytes instances.
> *Return*: str (coerced / decoded if needed).
> *Audience*: code that deals explicitly with str but wants to 'try'
> supporting bytes-based path inputs too via implicit decoding (even if
> it may result in surrogate escapes, which one cannot for instance
> print(...).)
>
> (E) "coerce to bytes"
> *Accept*: str and bytes, provided via __fspath__ as well as plain str
> and bytes instances.
> *Return*: bytes (coerced / encoded if needed).
> *Audience*: low-level code that explicitly deals with bytes paths but
> wants to accept str-based path inputs too via implicit encoding.
>
>
> Even if all options (A-E) probably have small audiences (compared to
> e.g. os.path.*), some of them have larger audiences than others. But
> all of them have at least *some* reasonable audience (as desribed
> above).
>
> Recently (well, a few days ago, but 'recently', considering the scale
> of these discussions anyway ;-), Nick pointed out something I hadn't
> realized---os.fsencode and os.fsdecode actually already implement
> coercion to bytes and str, respectively. With those two functions made
> compatible with the __fspath__ protocol [using (A) above], they would
> in fact *be* (D) and (E), respectively.
>
> Now, we only have options (A-C) left. They could all be implemented
> roughly as follows:
>
> def fspath(pathlike, *, output_types = (str,)):
>   if hasattr(pathlike, '__fspath__'):
>     ret = pathlike.__fspath__()  # or pathlike.__fspath__ if it's not a
> method
>   else:
>     ret = pathlike
>   if not isinstance(ret, output_types):
>     raise TypeError("argument is not and does not provide an
> acceptable pathname")
>   return ret
>
> With an implementation like the above, (A) would correspond to
> output_types = (str, bytes), (B) to the default, and (C) to
> output_types = (bytes,).
>
>
> So, with the above considerations as a counterargument, I consider
> argument (2) gone.
>
> What about argument (1), that the audience for the os.fspath(...)
> function (especially for one selected version of the 5 or 10
> variations!) is quite small, and we should not encourage manipulating
> pathnames by hand, but to use os.path.* or pathlib instead?
>
> The counterargument for (1):
>
> It seems to me we now "all" agree that __fspath__ should allow
> str+bytes polymorphism. I could try to list who I mean by "all"
> (Ethan, Brett, Stephen T, Nick, ... ?), but obviously I won't be able
> to list all or speak for them so I won't even try :-). Anyway, for
> this argument, I'm assuming we agree on that. So, __fspath__ can
> provide either str or bytes, even if str is *highly preferred* in most
> places. Therefore, the os.fspath function, as part of the protocol,
> has the important role of *by default* rejecting bytes, so that the
> protocol effectively becomes str-only by default. With the fspath
> implementation like the one I drafted above, and
> os.fsencode+os.fsdecode, we in fact cover all cases (A-E).
>
> So, as a summary: With a str+bytes-polymorphic __fspath__, with the
> above argumentation and the rough implementation of os.fspath(...),
> the conclusion is that the os.fspath function should indeed be public,
> and that no further variations are needed.
>
> -Koos
>
> P.S. There is also the possibility of two dunder methods corresponding
> to str and bytes, leading to one being preferred over the other in
> some cases etc. I have gone though various aspects and possible
> versions of that approach, but concluded it's not worth it, as some of
> us may also have implied in earlier posts. After all, we want
> something that's *almost* exclusively str.
>

Just to add to the chorus of praise, thanks for the summary, Koos!

I just wanted to add a rephrasing to your overall conclusion that I reached
independently Friday night but couldn't post earlier as I promised my wife
I wouldn't write or say the "P" word all weekend which meant I didn't read
or respond to any python-dev email all weekend (if you think that's cruel
and unusual punishment, her Twitter is https://twitter.com/AndreaMcInnes21 ;)
.

If we continue with the "str is an encoding of file paths", you can then
build from "bytes is an encoding of str" to get a pyramid of file path
encodings: Path -> str -> bytes. I don't think this is in any way a
controversial view.

Now Stephen has been promoting the idea of enhancing os.fsencode() and
os.fsdecode() to understand what __fspath__ is (I'm ignoring the str/bytes
return points for now). With os.fsencode() this would mean giving it
anything in the Path -> str -> bytes pyramid would lead to following the
steps to reach bytes at the bottom of the encoding pyramid. That's fine and
easy to explain: whatever you pass into os.fsencode() you know it will get
encoded to bytes using the file system encoding and surrogate escape.

The trick becomes os.fsdecode() and its str return value. Looking at our
encoding pyramid of Path -> str -> bytes we notice that the return value
for os.fsdecode() is actually now in the *middle* of our encoding pyramid.
What that means is that while passing in bytes and decoding them to str
makes sense, passing in a Path object and getting back str is actually an
*encoding*! My brain wanting semantic purity for the "decode" part of
os.fsdecode() started to hurt.

But that's when I realized that adding __fspath__ support to os.fsdecode()
and os.fsencode(), they become more coercion functions rather than
encoding/decoding functions. It also means that os.fspath() has a place
when you want to say "I only want to encode a file path to str" and avoid
the decode bit that os.fsdecode() would do (IOW it's like a half step of
os.fsencode() for full control). You probably also want control of getting
just bytes and skipping os.fsencode() and its automatic encoding call so
that you don't accidentally get mojibake or something.

Now going back to what __fspath__ returns, this starts to promote that it
returns the highest level in the Path -> str -> bytes pyramid that isn't
the top. We then provide whatever support we need to allow to go straight
to the encoding someone might want through the os module. Koos outlined all
of this above so I'm not going to rehash it all here, but the point will be
the protocol will be more low-level than we expect people to work with and
we will promote the use of the proper helper functions in the os module to
get the results people desire (although I still feel a little bad for
people writing libraries that will be manipulating paths prior to Python
3.6 who don't get this helper code, but my assumption is that they will get
TypeError from using whatever __fspath__() returns and e.g. os.path.join()
w/ a different type, otherwise they are just passing paths down to the
stdlib and so shouldn't inhibit usage of specific path encodings).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160418/305711ff/attachment.html>


More information about the Python-Dev mailing list