[Python-ideas] find-like functionality in pathlib

Tue Dec 22 16:54:55 EST 2015

On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido at python.org> wrote:

>The UNIX find tool has many, many options.

I think a Pythonicized, stripped-down version of the basic design of fts (http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're going to get. After all, fts was designed to make it as easy as possible to implement find efficiently. In my incomplete Python wrapper around fts, the simplest use looks like:

    with fts(root) as f:
        for path in f:
            do_stuff(path)

No two-level iteration, no need to join the root to the paths, no handling dirs and files separately.

Of course for that basic use case, you could just write your own wrapper around os.walk:

    def flatwalk(*args, **kwargs): 
        return (os.path.join(root, file) 
                for file in files for root, dirs, files in os.walk(*args, **kwargs)) 

But more complex uses build on fts pretty readably:

    # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff '{}' \; 
    yesterday = datetime.now() - timedelta(days=1) 
    with fts(top, stat=True, crossdev=False) as f: 
        for path in f: 
            if path.is_file and path.stat.st_mtime < yesterday and path.lower().endswith('.pyc'): 
                do_stuff(path) 

When you actually need to go a directory at a time, like the spool directory size example in the stdlib, os.walk is arguably nicer, but fortunately os.walk already exists.

The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler. 

Anyway, if you don't want either the efficiency or the simplicity, and just want an iterable of filenames or Paths, you might as well just use the wrapper around the existing os.walk that I wrote above. To make it works with Path objects:

    def flatpathwalk(root, *args, **kwargs):

        return map(path.Path, flatwalk(str(root), *args, **kwargs))

And then to use those Path objects:

    matches = (path for path in flatpathwalk(root) if pattern.match(str(path)))

> For the general case it's probably easier to use os.walk(). But there are probably some 
> common uses that deserve better direct support in e.g. the glob module. Would just a way 
> to recursively search for matches using e.g. "**.txt" be sufficient? If not, can you 
> specify what else you'd like? (Just " find-like" is too vague.)>--Guido (mobile)

pathlib already has a glob method, which handles '*/*.py' and even recursive '**/*.py' (and a match method to go with it). If that's sufficient, it's already there. Adding direct support for Path objects in the glob module would just be a second way to do the exact same thing. And honestly, if open, os.walk, etc. aren't going to work with Path objects, why should glob.glob?

* Honestly, I think the problem here is that the pathlib module is just not useful. In a new language that used path objects--or, probably, URL objects--everywhere, it would be hard to design something better than pathlib, but as it is, while it's great for making really hairy path manipulation more readable, path manipulation never _gets_ really hairy, and os.path is already very well designed, and the fact that pathlib doesn't know how to interact with anything else in the stdlib or third-party code means that the wrapper stuff that constructs a Path on one end and calls str or bytes on the other end depending on which one you originally had adds as much complexity as you saved. But that's obviously off-topic here.