[Python-ideas] Making pathlib paths inherit from str

Paul Moore p.f.moore at gmail.com
Wed Apr 6 14:31:57 EDT 2016


On 6 April 2016 at 17:04, Koos Zevenhoven <k7hoven at gmail.com> wrote:
> In the "Working with Path objects: p-strings?" thread, I said I was
> working on a proposal. Since it's been several days already, I think i
> should post it here and get some feedback before going any further.
> Maybe I should have done that even earlier. Anyway, there are some
> rough edges, and I will need to add links to references etc.

Thanks for putting this together. I don't agree with much of it, but
it's good to have the proposal stated so clearly.

> So, do not hesitate to give feedback or criticism, which is especially
> appreciated it you take the time to read through the whole thing first
> :).

While I've read the whole proposal, there's a lot to digest, and
honestly I don't have the time to spend on this right now - so my
apologies if I missed anything relevant. Hopefully my comments will
make sense anyway :-)

> Filesystem paths are strings that give instructions for traversing a
> directory tree. In Python, they have traditionally been represented as
> byte strings, and more recently, unicode string. However, Python now has
> ``pathlib`` in the standard library, which is an object-oriented library
> for dealing with objects specialized in representing a path and working
> with it. In this proposal, such objects are generally referred to as
> *path objects*, or sometimes, in the specific context of instances of
> the ``pathlib`` path classes, they are explicitly referred to as
> ``pathlib`` objects.

I'm not sure I agree with this. To me, "filesystem paths" are a things
which define the location of a file in a filesystem. They are not
strings, even though they can be represented by strings (actually,
they can't, technically - POSIX allows nearly arbitrary bytestrings
for for paths, whereas Python strings are Unicode). Saying a path is a
string is no more true than saying that integers are strings that
represent whole numbers.

Traditionally, people haven't thought of paths as objects because not
many languages provide *any* sort of abstraction of paths - doing so
in a cross-platform way is *hard* and most languages duck the issue.
Python is exceptional in providing good path manipulation functions
(even os.path is streets ahead of what many other languages offer).

> Filesystem paths (or comparable things like URIs) are strings of
> characters that represent information needed to access a file or
> directory (or other resource). In other words, they form a subset of
> strings, involving specialized functionality such as joining absolute
> and relative paths together, accessing different parts of the path or
> file name, and even accessing the resources the path points to. In
> Python terms, for a path ``path``, one would have
> ``isinstance(path, str)``. It is also clear that not all strings are
> paths.

As noted above, this makes no sense to me. By this argument "integers
are strings of characters that represent numbers". The string
representation of an object is *not* the object.

> On the one hand, this would make an ideal case for making all
> path-representing objects inherit from ``str``; while Python tries not
> to over-emphasize object-oriented programming and inheritance, it should
> not try to avoid class hierarchies when they are appropriate in terms of
> both purity and practicality. Regarding practicality, making specialized
> *path objects* also instances of ``str`` would make almost any stdlib or
> third-party function accept path objects as path arguments, assuming
> that they accept any instance of ``str``. Furthermore, functions now
> returning instances of ``str`` to represent paths could in future
> versions return path objects, with only minor backwards-incompatibility
> worries.

You mention both practicality and purity here but only offer
"practical" arguments. The practical arguments are fair, and as far as
I can see are the crux of any proposal to make Path objects subclass
str. You should focus on this, and not try to argue that subclassing
str is "right" in any purity sense.

> On the other hand, strings are a very general concept, and the Python
> ``str`` class provides a large variety of methods to manipulate and work
> with them, including ``.split()``, ``.find()``, ``.isnumeric()`` and
> ``.join()``. These operations may be defined just as well for a string
> that represents a path than for any other string. In fact, this is the
> status quo in Python, as the adoption of ``pathlib`` is still quite
> limited and paths are in most cases represented as strings (sometimes
> byte strings). But while the string operations are *defined* on
> path-representing strings, the results of these operations may not be of
> any use in most cases, even if in some cases, they may be.

This seems to me to be a key point - if (many) of the operations that
are part of the interface of a string don't make sense for a
filesystem path, doesn't that very clearly make the point that
filesystem paths are *not* strings?

> There is prior art in subclassing the Python ``str`` type to build a
> path object type. Packages on PyPI (TODO: list more?) that do this

pylib's path.local object (used in pytest in particular) is another.

> include ``path.py`` and ``antipathy``. The latter also supports
> ``bytes``-based paths by instantiating a different class, a subclass of
> ``bytes``. Since these libraries have existed for several years,
> experience from them is available for evaluating the potential benefits
> and weaknesses of this proposal (as well as other aspects regarding
> ``pathlib``).

I don't think there's been any attempt made to collect or quantify
that experience, though. All I've ever seen is hearsay "I've not heard
of anyone reporting problems" evidence. While anecdotal evidence is a
lot better than nothing, it's of limited value. Apart from anything
else, there's a self-selection issue - people who *did* have problems
may simply have stopped using the libraries.

> Overriding all ``str``-specific methods
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Since most of the ``str`` methods are not of any use on paths and can be
> confusing, leading to undesired behavior, *most* ``str`` methods
> (including magic methods, but excluding methods listed below) are
> overridden in ``PurePath`` with methods that by default raise
> ``TypeError("str method '<name>' is not available for paths."``. This
> will help programmers to immediately notice when they are using the
> wrong method. The perhaps unusual practice of disabling most base-class
> methods can be regarded as being conservative in adding ``str``
> functionality to path objects.

This seems to me to be the biggest issue. You're proposing that Path
objects will subclass strings, but code written to expect a string may
fail if passed a Path object. Presumably though that code works if
passed str(the_path_object) - as it works correctly right now. Maybe
it's doing "string-like" things, but equally, it's presumably intended
to. Consider a "make path uppercase" function that simply does
.upper() on its argument.

You are proposing a class that is a subclass of str, but calling str()
on an instance gives an object that behaves differently. That's
bizarre at best, and realistically I'd describe it as fundamentally
broken. I don't want to argue type-theory here, but I'm pretty sure
that violates most people's intuition of what inheritance means.

> Optional enabling of string methods
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Since many APIs currently have functions or methods that return paths as
> strings, existing code may expect to have all string functionality
> available on the returned objects. While most users are unlikely to use
> much of the ``str`` functionality, a library function may want to
> explicitly allow these operations on a path object that it returns.
> Therefore, the overridden ``str`` methods can be enabled by setting a
> ``._enable_str_functionality`` method on a path object as follows:
>
> -  ``pathobj._enable_str_functionality = True    #`` -- Enable ``str``
>    methods
> -  ``pathobj._enable_str_functionality = 'warn'  #`` -- Enable ``str``
>    methods, but emit a ``FutureWarning`` with the message
>    ``"str method '<name>' may be disabled on paths in future versions."``

This is a huge chunk of extra complexity, both in terms of
implementation, and even more so in terms of understanding.

If someone wants a "real" string, just call cast using str() or use
the .path attribute.

This whole section of the proposal says to me that you haven't
actually solved the problem you're trying to solve - you still expect
people to have problems passing Path objects to functions that aren't
expecting them, and you've had to consider how to work round that. The
fact that you came up with (in effect) a "configuration flag" on an
immutable object like a Path rather than just using the existing "give
me a real string" options on Path, implies that your proposal is not
well thought through in this area.

Here's some questions for you (but IMO this section is unfixable - no
matter what answers you give, I still consider this whole mechanism as
a non-starter).

* Are Path objects hashable, given they now have a mutable attribute?
* If you change the _enable_str_functionality flag, does the object's
hash change?
* If it doesn't, what happens when you add 2 identical paths with
different _enable_str_functionality flags to a set?
* If you enable str methods do they return str or Path objects? If the
latter, what is the flag set to on these objects?

Basically, you broke a fundamental property of both Path and string
objects - they are immutable.

> Changes needed to other stdlib modules
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> In stdlib modules other than ``pathlib``, mainly ``os``, ``ntpath`` and
> ``posixpath``, The stdlib functions in modules that use the
> methods/functionality listed below on path or file names, will be
> modified to explicitly convert the name ``name`` to a plain string
> first, e.g., using ``getattr(name, 'path', name)``, which also works for
> ``DirEntry`` but may return ``bytes``:
>
> -  ``split``
> -  ``find``
> -  ``rfind``
> -  ``partition``
> -  ``__iter__``
> -  ``__getitem__``

This can be done with the current Path objects (and should). It is
unrelated to this proposal. And it doesn't need to be restricted to
"if overridden string functions are used". Just do it regardless, and
all existing functions work immediately.

The only issue is functions that *return* paths. And they are no
harder under current Pathlib than under your proposal - a decision on
what type to return has to be made either way.

> Guidelines for third-party package maintainers
> ----------------------------------------------
>
> Libraries that take paths as arguments or return them
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Since all of the standard library will accept path objects as path
> arguments, most third-party libraries will automatically do so. However,
> those that directly manipulate or examine the path name using ``str``
> methods may not work. Those libraries will not immediately be
> ``pathlib``-compatible.

Overcomplicated. If you accept paths, just do getattr(patharg, 'path',
patharg) and you're fine. If you return paths, do nothing (or if you
prefer, think about your API and make a more considered decision).

Your proposal means that library authors have to actually consider
whether the new path objects will cause subtle failures, because the
string-like objects will not fail quickly, leading to bugs propogating
into unrelated code.

Overall, I'm a strong -1. If we subclass str, we should just do it and
not over-complicate like this. I'm still not convinced we should do
so, but your proposal *has* convinced me that any attempt to
compromise is going to end up being worse than either option.

Sorry I can't be more positive - but again, thanks for the thorough write-up.
Paul


More information about the Python-ideas mailing list