os.path.walk (was: Re: Optimizing code)

Gordon McMillan gmcm at hypernet.com
Fri Feb 25 08:45:10 EST 2000


François Pinard  writes:

> This has bothered me several times already, in Python.  Perl has a device
> caching the last `stat' result, quite easy to use, for allowing users to
> precisely optimise such cases.  In many cases, the user has no reason
> to think the file system changed enough, recently, to be worth calling
> `stat' again.  Of course, one might call `stat' in his code and use the
> resulting info block, and this is what I do.  But does not interface well
> with os.path.walk.

Python has the statcache module. os.path.walk does not use 
it.
 
> It would be nice if the Python library was maintaining a little cache for
> `stat', and if there was a way for users to interface with it as wanted.
> 
> By the way, the `find' program has some optimisations to avoid calling
> `stat', which yield a very significant speed-up on Unix (I do not know
> that these optimisations can be translated to other OS-es, however).
> Could os.path.walk use them, if not already?  The main trick is to save
> the number of links on the `.' entry, knowing that it we have one link
> in the including directory, one link for the `.' entry itself, and one
> `..' link per sub-directory.  Each time we `stat' a sub-directory in the
> current one, we decrease the saved count.  When it reaches 2, we know
> that no directories remain, and so, can spare all remaining `stat' calls.
> It even works for the root directory, because the fact that there is no
> including directory is compensated by the fact that `..' points to `/'.
> 
> I surely often use `os.path.walk' in my own things, so any speed improvement
> in that area would be welcome for me.

There's also a dircache module, which caches os.listdir 
output, and stats the directory to see if the cache is still valid. 
These tricks get wildly platform specific though. In my 
experience on Windows, to get a valid stat on a directory, you 
need to stat some file in that directory. Also, the resolution on 
WIndows is 2 secs - so if the cache is less than 2 seconds 
old, you can't assume it is valid.

- Gordon




More information about the Python-list mailing list