[Python-Dev] Alternative path suggestion

Thu May 4 14:57:05 CEST 2006

Mike Orr wrote:
> Intriguing idea, Noam, and excellent thinking.  I'd say it's worth a
> separate PEP.  It's too different to fit into PEP 355, and too big to
> be summarized in the "Open Issues" section.  Of course, one PEP will
> be rejected if the other is approved.

I agree that a competing PEP is probably the best way to track this idea.

> The main difficulty with this approach is it's so radical.  It would
> require a serious champion to convince people it's as good as our
> tried-and-true strings.

Guido has indicated strong dissatisfaction with the idea of subclassing 
str/unicode with respect to PEP 355. And if you're not subclassing those types 
in order to be able to pass the object transparently to existing APIs, then it 
makes much more sense to choose a sensible internal representation like a 
tuple, so that issues like os.sep can be deferred until they matter.

That's generally been my problem with PEP 355 - a string is an excellent 
format for describing and displaying paths, but its a *hopeless* format for 
manipulating them.

A real data type that acts as a convenience API for the various 
string-accepting filesystem APIs would be a good thing.

>> == a tuple instead of a string ==
>>
>> The biggest conceptual change is that my path object is a subclass of
>> ''tuple'', not a subclass of str.

Why subclass anything? The path should internally represent the filesystem 
path as a list or tuple, but it shouldn't *be* a tuple.

Also, was it a deliberate design decision to make the path objects immutable, 
or was that simply copied from the fact that strings are immutable?

Given that immutability allows the string representation to be calculated once 
and then cached, I'll go with that interpretation.

>>>>> tuple(path('/a/b/c'))
>> (path.ROOT, 'a', 'b', 'c')
>> }}}
> 
> How about  an .isabsolute attribute instead of prepending path.ROOT? 
> I can see arguments both ways.  An attribute is easy to query and easy
> for str() to use, but it wouldn't show up in a tuple-style repr().

You can have your cake and eat it too by storing most of the state in the 
internal tuple, but providing convenience attributes to perform certain queries.

This would actually be the criterion for the property/method distinction. 
Properties can be determined solely by inspecting the internal data store, 
whereas methods would require accessing the filesystem.

>> This means that path objects aren't the string representation of a
>> path; they are a ''logical'' representation of a path. Remember why a
>> filesystem path is called a path - because it's a way to get from one
>> place on the filesystem to another. Paths can be relative, which means
>> that they don't define from where to start the walk, and can be not
>> relative, which means that they do. In the tuple representation,
>> relative paths are simply tuples of strings, and not relative paths
>> are tuples of strings with a first "root" element.

I suggest storing the first element separately from the rest of the path. The 
reason for suggesting this is that you use 'os.sep' to separate elements in 
the normal path, but *not* to separate the first element from the rest.

Possible values for the path's root element would then be:

   None ==> relative path
   path.ROOT ==> Unix absolute path
   path.DRIVECWD ==> Windows drive relative path
   path.DRIVEROOT ==> Windows drive absolute path
   path.UNCSHARE  ==> UNC path
   path.URL  ==> URL path

The last four would have attributes (the two Windows ones to get at the drive 
letter, the UNC one to get at the share name, and the URL one to get at the 
text of the URL).

Similarly, I would separate out the extension to a distinct attribute, as it 
too uses a different separator from the normal path elements ('.' most places, 
but '/' on RISC OS, for example)

The string representation would then be:

   def __str__(self):
       return (str(self.root)
               + os.sep.join(self.path)
               + os.extsep + self.ext)

>> The advantage of using a logical representation is that you can forget
>> about the textual representation, which can be really complex.

As noted earlier - text is a great format for path related I/O. It's a lousy 
format for path manipulation.

>> {{{
>> p.normpath()  -> Isn't needed - done by the constructor
>> p.basename()  -> p[-1]
>> p.splitpath() -> (p[:-1], p[-1])
>> p.splitunc()  -> (p[0], p[1:]) (if isinstance(p[0], path.UNCRoot))
>> p.splitall()  -> Isn't needed
>> p.parent      -> p[:-1]
>> p.name        -> p[-1]
>> p.drive       -> p[0] (if isinstance(p[0], path.Drive))
>> p.uncshare    -> p[0] (if isinstance(p[0], path.UNCRoot))
>> }}}

These same operations using separate root and path attributes:

p.basename()  -> p[-1]
p.splitpath() -> (p[:-1], p[-1])
p.splitunc()  -> (p.root, p.path)
p.splitall()  -> Isn't needed
p.parent      -> p[:-1]
p.name        -> p[-1]
p.drive       -> p.root.drive  (AttributeError if not drive based)
p.uncshare    -> p.root.share  (AttributeError if not drive based)

> That's a big drawback.  PEP 355 can choose between string and
> non-string, but this way is limited to non-string.  That raises the
> minor issue of changing the open() functions etc in the standard
> library, and the major issue of changing them in third-party
> libraries.

It's not that big a drama, really. All you need to do is call str() on your 
path objects when you're done manipulating them. The third party libraries 
don't need to know how you created your paths, only what you ended up with.

Alternatively, if the path elements are stored in separate attributes, there's 
nothing stopping the main object from inheriting from str or unicode the way 
the PEP 355 path object does.

Either way, this object would still be far more convenient for manipulating 
paths than a string based representation that has to deal with OS-specific 
issues on every operation, rather than only during creation and conversion to 
a string. The path objects would also serve as an OS-independent 
representation of filesystem paths.

In fact, I'd leave most of the low-level API's working only on strings - the 
only one I'd change to accept path objects directly is open() (which would be 
fairly easy, as that's a factory function now).

>> This means that paths starting with a drive letter alone
>> (!UnrootedDrive instance, in my module) and paths starting with a
>> backslash alone (the CURROOT object, in my module) are not relative
>> and not absolute.
> 
> I guess that's plausable.  We'll need feedback from Windows users.

As suggested above, I think the root element should be stored separately from 
the rest of the path. Then adding a new kind of root element (such as a URL) 
becomes trivial.

> The question is, does forcing people to use .stat() expose an
> implementation detail that should be hidden, and does it smell of
> Unixism?  Most people think a file *is* a regular file or a directory.
>  The fact that this is encoded in the file's permission bits -- which
> stat() examines -- is a quirk of Unix.

I wouldn't expose stat() - as you say, it's a Unixism. Instead, I'd provide a 
subclass of Path that used lstat instead of stat for symbolic links.

So if I want symbolic links followed, I use the normal Path class. This class 
just generally treat symbolic links as if they were the file pointed to 
(except for the whole not recursing into symlinked subdirectories thing).

The SymbolicPath subclass would treat normal files as usual, but *wouldn't* 
follow symbolic links when stat'ting files (instead, it would stat the symlink).

>> == One Method for Finding Files ==
>>
>> (They're actually two, but with exactly the same interface). The
>> original path object has these methods for finding files:
>>
>> {{{
>> def listdir(self, pattern = None): ...
>> def dirs(self, pattern = None): ...
>> def files(self, pattern = None): ...
>> def walk(self, pattern = None): ...
>> def walkdirs(self, pattern = None): ...
>> def walkfiles(self, pattern = None): ...
>> def glob(self, pattern):
>> }}}
>>
>> I suggest one method that replaces all those:
>> {{{
>> def glob(self, pattern='*', topdown=True, onlydirs=False, onlyfiles=False): ...
>> }}}

Swiss army methods are even more evil than wide APIs. And I consider the term 
'glob' itself to be a Unixism - I've found the technique to be far more 
commonly known as wildcard matching in the Windows world.

The path module has those methods for 7 distinct use cases:
   - list the contents of this directory
   - list the subdirectories of this directory
   - list the files in this directory
   - walk the directory tree rooted at this point, yielding both files and dirs
   - walk the directory tree rooted at this point, yielding only the dirs
   - walk the directory tree rooted at this point, yielding only the files
   - walk this pattern

The first 3 operations are far more common than the last 4, so they need to stay.

   def entries(self, pattern=None):
       """Return list of all entries in directory"""
       _path = type(self)
       all_entries = os.listdir(str(self))
       if pattern is not None:
           return [_path(x) for x in all_entries if x.matches(pattern)]
       return [_path(x) for x in all_entries]

   def subdirs(self, pattern=None)
       """Return list of all subdirectories in directory"""
       return [x for x in self.entries(pattern) if x.is_dir()]

   def files(self, pattern=None)
       """Return list of all files in directory"""
       return [x for x in self.entries(pattern) if x.is_dir()]

   # here's sample implementations of the test methods used above
   def matches(self, pattern):
       return fnmatch.fnmatch(str(self), pattern)
   def is_dir(self):
       return os.isdir(str(self))
   def is_file(self):
       return os.isfile(str(self))

For the tree traversal operations, there are still multiple use cases:

   def walk(self, topdown=True, onerror=None)
       """ Walk directories and files just as os.walk does"""
       # Similar to os.walk, only yielding Path objects instead of strings
       # For each directory, effectively returns:
       #    yield dirpath, dirpath.subdirs(), dirpath.files()

   def walkdirs(self, pattern=None, onerror=None)
       """Only walk directories matching pattern"""
       for dirpath, subdirs, files in self.walk(onerror=onerror):
           yield dirpath
           if pattern is not None:
               # Modify in-place so that walk() responds to the change
               subdirs[:] = [x for x in subdirs if x.matches(pattern)]

   def walkfiles(self, pattern=None, onerror=None)
       """Only walk file names matching pattern"""
       for dirpath, subdirs, files in self.walk(onerror=onerror):
           if pattern is not None:
               for f in files:
                   if f.match(pattern):
                       yield f
           else:
               for f in files:
                   yield f

   def walkpattern(self, pattern=None)
       """Only walk paths matching glob pattern"""
       _factory = type(self)
       for pathname in glob.glob(pattern):
           yield _factory(pathname)

>> pattern is the good old glob pattern, with one additional extension:
>> "**" matches any number of subdirectories, including 0. This means
>> that '**' means "all the files in a directory", '**/a' means "all the
>> files in a directory called a", and '**/a*/**/b*' means "all the files
>> in a directory whose name starts with 'b' and the name of one of their
>> parent directories starts with 'a'".
> 
> I like the separate methods, but OK.  I hope it doesn't *really* call
> glob if the pattern is the default.

Keep the separate methods. Trying to squeeze too many disparate use cases 
through a single API is a bad idea. Directory listing and tree-traversal are 
not the same thing. Path matching and filename matching are not the same thing 
either.

> Or one could, gasp, pass a constant or the 'find' command's
> abbreviation ("d" directory, "f" file, "s" socket, "b" block
> special...).

Magic letters in an API are just as bad as magic numbers :)

More importantly, these things don't port well between systems.

>> In my proposal:
>>
>> {{{
>> def copy(self, dst, copystat=False): ...
>> }}}
>>
>> It's just that I think that copyfile, copymode and copystat aren't
>> usually useful, and there's no reason not to unite copy and copy2.
> 
> Sounds good.

OK, this is one case where a swiss army method may make sense. Specifically, 
something like:

   def copy_to(self, dest, copyfile=True, copymode=True, copytime=False)

Whether or not to copy the file contents, the permission settings and the last 
access and modification time are then all independently selectable.

The different method name also makes the direction of the copying clear (with 
a bare 'copy', it's slightly ambiguous as the 'cp src dest' parallel isn't as 
strong as it is with a function).

> I was wondering what the fallout would be of normalizing "a/../b" and
> "a/./b" and "a//b", but it sounds like you're thinking about it.

The latter two are OK, but normalizing the first one can give you the wrong 
answer if 'a' is a symlink (since 'a/../b' is then not necessarily the same as 
'b').

Better to just leave the '..' in and not treat it as something that can be 
normalised away.

>> I removed the methods associated with file extensions. I don't recall
>> using them, and since they're purely textual and not OS-dependent, I
>> think that you can always do p[-1].rsplit('.', 1).

Most modern OS's use '.' as the extension separator, true, but os.extsep still 
exists for a reason :)

> .namebase is an obnoxious name though.  I wish we could come up with
> something better.

p.path[-1] :)

Then p.name can just do (p.path[-1] + os.extsep + p.ext) to rebuild the full 
filename including the extension (if p.ext was None, then p.name would be the 
same as p.path[-1])

>> I removed expand. There's no need to use normpath, so it's equivalent
>> to .expanduser().expandvars(), and I think that the explicit form is
>> better.
> 
> Expand is useful though, so you don't forget one or the other.

And as you'll usually want to do both, adding about 15 extra characters for no 
good reason seems like a bad idea. . .

>> copytree - I removed it. In shutil it's documented as being mostly a
>> demonstration, and I'm not sure if it's really useful.
> 
> Er, not sure I've used it, but it seems useful.  Why force people to
> reinvent the wheel with their own recursive loops that they may get
> wrong?

Because the handling of exceptional cases is almost always going to be 
application specific. Note that even os.walk provides a callback hook for if 
the call to os.listdir() fails when attempting to descend into a directory.

For copytree, the issues to be considered are significantly worse:
   - what to do if listdir fails in the source tree?
   - what to do if reading a file fails in the source tree?
   - what to do if a directory doesn't exist in the target tree?
   - what to do if a directory already exists in the target tree?
   - what to do if a file already exists in the target tree?
   - what to do if writing a file fails in the target tree?
   - should the file contents/mode/time be copied to the target tree?
   - what to do with symlinks in the source tree?

Now, what might potentially be genuinely useful is paired walk methods that 
allowed the following:

   # Do path.walk over this directory, and also return the corresponding
   # information for a destination directory (so the dest dir information
   # probably *won't* match that file system
   for src_info, dest_info in src_path.pairedwalk(dest_path):
       src_dirpath, src_subdirs, src_files = src_info
       dest_dirpath, dest_subdirs, dest_files = dest_info
       # Do something useful

   # Ditto for path.walkdirs
   for src_dirpath, dest_dirpath in src_path.pairedwalkdirs(dest_path):
       # Do something useful

   # Ditto for path.walkfiles
   for src_path, dest_path in src_path.pairedwalkfiles(dest_path):
       src_path.copy_to(dest_path)

> You've got two issues here.  One is to go to a tuple base and replace
> several properties with slicing.  The other is all your other proposed
> changes.  Ideally the PEP would be written in a way that these other
> changes can be propagated back and forth between the PEPs as consensus
> builds.

The main thing Jason's path object has going for it is that it brings together 
Python's disparate filesystem manipulation API's into one place. Using it is a 
definite improvement over using the standard lib directly.

However, the choice to use a string as the internal storage instead a more 
appropriate format (such as the three-piece structure I suggest of root, path, 
extension), it doesn't do as much as it could to abstract away the hassles of 
os.sep and os.extsep.

By focusing on the idea that strings are for path input and output operations, 
rather than for path manipulation, it should be possible to build something 
even more usable than path.py

If it was done within the next several months and released on PyPI, it might 
even be a contender for 2.6.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org