[Python-ideas] PEP471 - (os.scandir())

Mon Nov 30 01:45:25 EST 2015

On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas
<python-ideas at python.org> wrote:
> On Nov 27, 2015, at 10:32, Paul Moore <p.f.moore at gmail.com> wrote:
>>
>>> On 26 November 2015 at 23:22, Erik <python at lucidity.plus.com> wrote:
>>> I have studied the PEP, followed a lot of the references and looked at the
>>> 3.5.0 implementation. I can't see that I've missed such a thing already
>>> existing, but it's possible. If so, perhaps this is instead a request to
>>> make that thing more obvious somehow!
>>
>> Does pathlib use scandir? If so, then maybe you get the caching
>> benefits by using pathlib? And if pathlib doesn't use scandir, maybe
>> it should? [I just checked, it looks like pathlib doesn't use scandir
>> :-(]
>
> Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.)
>
> Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?

The main problem with having pathlib do any caching at all is that
caching the results of stat calls implicitly in any context is a
recipe for significant confusion, since you're at the mercy of race
conditions as the filesystem changes out from underneath you. There
also isn't an obviously "right" answer in the general case for cache
invalidation, as in some cases you're interested in the file as it was
when you originally opened it, and don't care if it got swapped out
from underneath you, while in others you're interested in the file
path, and want the filesystem details for right now, not the details
from a few seconds ago.

For os.scandir(), we just delegate the behaviour to the underlying
filesystem APIs - how readdir() and FindNextFile react to the
directory contents changing during iteration is OS defined, and Python
will inherit that variation (and may miss newly added files as a
result).

The current os.walk() implementation constrains the scope of the
scandir() filesystem state caching, since it doesn't let the DirEntry
objects escape outside the generator - there's no need to ask yourself
"What's the risk of stale filesystem data here?", since you're not
getting access to the cached info in the first place, and hence always
need to go query the filesystem directly.

This is a fairly universal pattern: for a given *application* you can
likely figure out what to cache and when to invalidate it, even though
those are unanswerable questions in the general case. Another example
of that would be the stat caches in the current implementation of the
import system, together with the corresponding need to call
importlib.invalidate_caches() if you want to make sure the import
system can see a module that was only just written to disk.

That's not to say that a general purpose directory walking utility
producing DirEntry objects isn't an interesting prospect. Rather, it's
an attempt to highlight that this is an area where there may be a
significant gulf between "works for my use case" and "is a suitable
addition to the standard library", particularly since this can now be
a pure Python recipe atop os.scandir.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia