[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Robert Collins robertc at robertcollins.net
Wed Nov 14 10:53:44 CET 2012


On Wed, Nov 14, 2012 at 10:33 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Wed, Nov 14, 2012 at 5:14 PM, Ronald Oussoren <ronaldoussoren at mac.com>
> wrote:
>>
>> How did you measure the 5x speedup you saw with you modified os.walk?
>>
>> It would be interesting to see if Unix platforms have a simular speedup,
>> because
>> if they don't the new API could just return the results of stat (or lstat
>> ...).
>>
>
> One thing to keep in mind with these kind of metrics is that I/O latency is
> a major factor. Solid state vs spinning disk vs network drive is going to
> make a *big* difference to the relative performance of the different
> mechanisms. With NFS (et al), it's particularly important to minimise the
> number of round trips to the server (that's why the new dir listing caching
> in the 3.3 import system results in such dramatic speed-ups when some of the
> sys.path entries are located on network drives).


Data from bzr:
 you can get a very significant speed up by doing two things:
 - use readdir to get the inode numbers of the files in the directory
and stat the files in-increasing-number-order. (this gives you
monotonically increasing IO).
 - chdir to the directory before you stat and use a relative path: it
turns out when working with many files that the overhead of absolute
paths is substantial.

We got a (IIRC 90% reduction in 'bzr status' time applying both of
these things, and you can grab the pyrex module needed to do readdir
from bzr - though we tuned what we had to match the needs of a VCS, so
its likely too convoluted for general purpose use).

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services



More information about the Python-ideas mailing list