[Python-Dev] PEP 428: stat caching undesirable?

Pieter Nagel pieter at nagel.co.za
Wed May 1 13:22:20 CEST 2013


Antoine and Nick have convinced me that stat() calls can be a
performance issue.

I am still concerned about the best way to balance that against the need
to keep the API simple, though.

I'm still worried about the current behaviour that some path can answer
True to is_file() in a long-running process just because it had been a
file last week.

In my experience there are use cases where most stat() calls one makes
(including indirectly via is_file() and friends) want up-to-date data.
There is also the risk of obtaining a Path object that already had its
stat() value cached some time ago without your knowledge (i.e. if the
Path was created for you by a walkdir type function that in its turn
also called is_file() before returning the result).

And needing to precede each is_file() etc. call with a restat() call
whose return value is not even used introduces undesirable temporal
coupling between the restat() and is_file() call.

I see a few alternative solution, not mutually exclusive:

1) Change the signature of stat(), and everything that indirectly uses
stat(), to take an optional 'fresh' keyword argument (or some synonym).
Then stat(fresh=True) becomes synonymous with the current restat(), and
the latter can be removed. Queries like is_file(fresh=True) will be
implemented by forwarding fresh to the underlying stat() call they are
implemented on.

What the default for 'fresh' should be, can be debated, but I'd argue
for the sake of naive code that fresh should default to True, and then
code that is aware of stat() caching can use fresh=False as required.

2) The root of the issue is keeping the cached stat() value
indefinitely.

Therefore, limit the duration for which the cached value is valid. The
challenge is to find a way to express how long the value should be
cached, without needing to call time.monotonic() or the like that
presumable are also OS calls that will release the GIL.

One way would be to compute the number of virtual machine instructions
executed since the stat() call was cached, and set the limit there. Is
that still possible, now that sys.setcheckinterval() has been gutted?

3) Leave it up to performance critical code, such as the import
machinery, or walkdirs that Nick mentioned, to do their own caching, and
simplify the filepath API for the simple case.

But one can still make life easier for code like that, by adding
is_file() and friends on the stat result object as I suggested.

But this almost sounds like a PEP of its own, because although pahtlib
will benefit by it, it is actually an orthogonal issue.

It raises all kinds of issues: should the signature be
statresult.isfile() to match os.path, or statresult.is_file() to match
PEP 428?

-- 
Pieter Nagel




More information about the Python-Dev mailing list