os.path.walk (was: Re: Optimizing code)

Fri Feb 25 09:54:19 EST 2000

François Pinard <pinard at iro.umontreal.ca> writes:

> This has bothered me several times already, in Python.  Perl has a device
> caching the last `stat' result, quite easy to use, for allowing users to
> precisely optimise such cases.  In many cases, the user has no reason
> to think the file system changed enough, recently, to be worth calling
> `stat' again.  Of course, one might call `stat' in his code and use the
> resulting info block, and this is what I do.  But does not interface well
> with os.path.walk.
> 
> It would be nice if the Python library was maintaining a little cache for
> `stat', and if there was a way for users to interface with it as wanted.

Question-- what would happen to code that uses os.path.exists() or one
of its friends repeatedly, waiting for a file to appear or disappear?
This code would break unless you put a timeout in your little stat
cache, which would probably reduce its speed.

> By the way, the `find' program has some optimisations to avoid calling
> `stat', which yield a very significant speed-up on Unix (I do not know
> that these optimisations can be translated to other OS-es, however).
> Could os.path.walk use them, if not already?  The main trick is to save
> the number of links on the `.' entry, knowing that it we have one link
> in the including directory, one link for the `.' entry itself, and one
> `..' link per sub-directory.  Each time we `stat' a sub-directory in the
> current one, we decrease the saved count.  When it reaches 2, we know
> that no directories remain, and so, can spare all remaining `stat' calls.
> It even works for the root directory, because the fact that there is no
> including directory is compensated by the fact that `..' points to `/'.

Yes, that's an old one.  I recall a discussion many many years ago on
c.l.perl where Tom Christiansen (or was it Larry Wall?) concluded
"thus it behooves us to arrange directories so that subdirectories
occur first" (or similar words).

> I surely often use `os.path.walk' in my own things, so any speed improvement
> in that area would be welcome for me.

Personally, for me it matters more that the code of walk() is only a
few totally portable lines.  Using knowledge like described above is
nice when you are writing a utility like find that is closely tied to
the OS on which it is running; however I can imagine any number of
reasons why the above rule might fail (mount points, NFS, Samba,
symbolic links, automounter, race conditions).  I'm not saying that
any of those will make it fail -- it's just impossible to be SURE that
they DON'T make it fail.  I'd rather have it the right answer tomorrow
than the wrong answer today, thank you.

(Note that I'm not against doing this in your own code .  I'm
reluctant to add hacks like this to the standard library -- a bug
there multiplies by a million.)

--Guido van Rossum (home page: http://www.python.org/~guido/)