[Python-ideas] find-like functionality in pathlib

Tue Dec 22 19:23:16 EST 2015

(Wow, what a rambling message. I'm not sure which part you hope to see
addressed.)

On Tue, Dec 22, 2015 at 1:54 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido at python.org>
> wrote:
>
> >The UNIX find tool has many, many options.
>
>
> I think a Pythonicized, stripped-down version of the basic design of fts (
> http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're
> going to get. After all, fts was designed to make it as easy as possible to
> implement find efficiently.

The docs make no attempt at showing the common patterns. The API described
looks horribly complex (I guess that's what you get when all that matters
is efficient implementation).

> In my incomplete Python wrapper around fts, the simplest use looks like:
>
>     with fts(root) as f:
>         for path in f:
>             do_stuff(path)
>
> No two-level iteration, no need to join the root to the paths, no handling
> dirs and files separately.
>

The two-level iteration forced upon you by os.walk() is indeed often
unnecessary -- but handling dirs and files separately usually makes sense,
and remarkably often there *is* something where the two-level iteration
helps (otherwise I'm sure you'd see lots of code that's trying to recover
the directory by parsing the path and remembering the previous path and
comparing the two).

>
>
> Of course for that basic use case, you could just write your own wrapper
> around os.walk:
>
>     def flatwalk(*args, **kwargs):
>         return (os.path.join(root, file)
>                 for file in files for root, dirs, files in os.walk(*args,
> **kwargs))
>
> But more complex uses build on fts pretty readably:
>
>     # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff
> '{}' \;
>     yesterday = datetime.now() - timedelta(days=1)
>     with fts(top, stat=True, crossdev=False) as f:
>         for path in f:
>             if path.is_file and path.stat.st_mtime < yesterday and
> path.lower().endswith('.pyc'):
>                 do_stuff(path)
>

Why does this use a with *and* a for-loop? Is there some terribly important
cleanup that needs to happen when the for-loop is aborted?

It also shows off the arbitrariness of the fts API -- fts() seems to have a
bunch of random keyword args to control a variety of aspects of its
behavior and the returned path objects look like they have a rather bizarre
API: e.g. why is is_file a property on path, mtime a property on path.stat,
and lower() a method on path directly? (And would path also have an
endswith() method directly, in case I don't need to lowercase it?)

Of course that's can all be cleaned up easily enough -- it's a simple
matter of API design.

>
>
> When you actually need to go a directory at a time, like the spool
> directory size example in the stdlib, os.walk is arguably nicer, but
> fortunately os.walk already exists.
>

I've never seen that example. But just a few days ago I wrote a little bit
of code where the os.walk() API came in handy:

for root, dirs, files in os.walk(arg):
    print("Scanning %s (%d files):" % (root, len(files)))
    for file in files:
        process(os.path.join(root, file))

(The point is not that we have access to dirs separately, but that we have
the directories filtered out of the count of files.)

> The problem isn't designing a nice walk API; it's integrating it with
> pathlib.* It seems fundamental to the design of pathlib that Path objects
> never cache anything. But the whole point of using something like fts is to
> do as few filesystem calls as possible to get the information you need; if
> it throws away everything it did and forces you to retrieve the same
> information gain (possibly even in a less efficient way), that kind of
> defeats the purpose. Even besides efficiency, having those properties all
> nicely organized and ready for you can make the code simpler.
>

Would it make sense to engage in a little duck typing and have an API that
mimicked the API of Path objects but caches the stat() information? This
could be built on top of scandir(), which provides some of the information
without needing extra syscalls (depending on the platform). But even where
a syscall() is still needed, this hypothetical Path-like object could cache
the stat() result. If this type of result was only returned by a new
hypothetical integration of os.walk() and pathlib, the caching would not be
objectionable (it would simply be a limitation of the pathwalk API, rather
than of the Path object).

> Anyway, if you don't want either the efficiency or the simplicity, and
> just want an iterable of filenames or Paths, you might as well just use the
> wrapper around the existing os.walk that I wrote above. To make it works
> with Path objects:
>
>
>     def flatpathwalk(root, *args, **kwargs):
>
>         return map(path.Path, flatwalk(str(root), *args, **kwargs))
>
> And then to use those Path objects:
>
>     matches = (path for path in flatpathwalk(root) if
> pattern.match(str(path)))
>
> > For the general case it's probably easier to use os.walk(). But there
> are probably some
> > common uses that deserve better direct support in e.g. the glob module.
> Would just a way
> > to recursively search for matches using e.g. "**.txt" be sufficient? If
> not, can you
> > specify what else you'd like? (Just " find-like" is too vague.)>--Guido
> (mobile)
>
> pathlib already has a glob method, which handles '*/*.py' and even
> recursive '**/*.py' (and a match method to go with it). If that's
> sufficient, it's already there. Adding direct support for Path objects in
> the glob module would just be a second way to do the exact same thing. And
> honestly, if open, os.walk, etc. aren't going to work with Path objects,
> why should glob.glob?
>

Oh, I'd forgotten about pathlib.Path.rglob().

Maybe the OP also didn't know about it? He claimed he just wanted to use
regular expressions so he could exclude .git directories. To tell the
truth, I don't have much sympathy for that: regular expressions are just
too full of traps to make a good API for file matching, and it wouldn't
even strictly be sufficient to filter the entire directory tree under .git
unless you added matching on the entire path -- but then you'd still pay
for the cost of traversing the .git tree even if your regex were to exclude
it entirely, because the library wouldn't be able to introspect the regex
to determine that for sure.

He also insisted on staying withing the Path framework, which is an
indication that maybe what we're really looking for here is the hybrid of
walk/scandir/Path that I was trying to allude to above.

> * Honestly, I think the problem here is that the pathlib module is just
> not useful. In a new language that used path objects--or, probably, URL
> objects--everywhere, it would be hard to design something better than
> pathlib, but as it is, while it's great for making really hairy path
> manipulation more readable, path manipulation never _gets_ really hairy,
> and os.path is already very well designed, and the fact that pathlib
> doesn't know how to interact with anything else in the stdlib or
> third-party code means that the wrapper stuff that constructs a Path on one
> end and calls str or bytes on the other end depending on which one you
> originally had adds as much complexity as you saved. But that's obviously
> off-topic here.
>

Seems the OP disagrees with you here -- he really wants to use pathlib (as
was clear from his response to a suggestion to use fnmatch).

Truly pushing for adoption of a new abstraction like this takes many years
-- pathlib was new (and provisional) in 3.4 so it really hasn't been long
enough to give up on it. The OP hasn't!

So, perhaps the pathlib.Path class needs to have some way to take in a
DirEntry produced by os.scandir() and a flag to allow it to cache stat()
results? Then we could easily write a pathlib.walk() function that's like
os.walk() but returning caching Path objects.

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20151222/623c3f13/attachment.html>