[Python-ideas] find-like functionality in pathlib

Paul Moore p.f.moore at gmail.com
Mon Jan 11 15:00:54 EST 2016


On 11 January 2016 at 18:57, Gregory P. Smith <greg at krypto.org> wrote:
> On Wed, Jan 6, 2016 at 3:05 PM Brendan Moloney <moloney at ohsu.edu> wrote:
>>
>> Its important to keep in mind the main benefit of scandir is you don't
>> have to do ANY stat call in many cases, because the directory listing
>> provides some subset of this info. On Linux you can at least tell if a path
>> is a file or directory.  On windows there is much more info provided by the
>> directory listing. Avoiding subsequent stat calls is also nice, but not
>> nearly as important due to OS level caching.
>
>
> +1 - this was one of the two primary motivations behind scandir.  Anything
> trying to reimplement a filesystem tree walker without using scandir is
> going to have sub-standard performance.
>
> If we ever offer anything with "find like functionality" related to pathlib,
> it needs to be based on scandir.  Anything else would just be repeating the
> convenient but untrue limiting assumptions of os.listdir: That the contents
> of a directory can be loaded into memory and that we don't mind re-querying
> the OS for stat information that it already gave us but we threw away as
> part of reading the directory.

This is very much why I feel that we need something in pathlib. I
understand the motivation for not caching stat information in path
objects. And I don't have a viable design for how a "find-like
functionality" API should be implemented in pathlib. But as it stands,
I feel as though using pathlib for anything that does bulk filesystem
scans is deliberately choosing something that I know won't scale well.
So (in my mind) pathlib doesn't fulfil the role of "one obvious way to
do things". Which is a shame, because Path.rglob is very often far
closer to what I need in my programs than os.walk (even when it's just
rootpath.rglob('*')).

In practice, by far the most common need I have[1] for filetree
walking is to want to get back a list of all the names of files
starting at a particular directory with the returned filenames
*relative to the given root*. Pathlib.rglob gives absolute pathnames.
os.walk gives the absolute directory name and the base filename.
Neither is what I want, although obviously in both cases it's pretty
trivial to extract the "relative to the root" part from the returned
data. But an API that gave that information directly, with
scandir-level speed and scalability, in the form of pathlib.Path
relative path objects, would be ideal for me[1].

Paul

[1] And yes, I know this means I should just write a utility function for it :-)
[2] The feature creep starts when people want to control things like
pruning particular directories such as '.git', or only matching
particular glob patterns, or choosing whether or not to include
directories in the output, or... Adding *those* features without
ending up with a Frankenstein's monster of an API is the challenge :-)


More information about the Python-ideas mailing list