Optimizing tips for os.listdir

Mon Sep 27 14:24:31 EDT 2004

On 27 Sep 2004 14:30:18 GMT, Nick Craig-Wood <nick at craig-wood.com> wrote:

>Thomas <2002 at weholt.org> wrote:
>>  I'm doing this :
>> 
>>  [os.path.join(path, p) for p in os.listdir(path) if \
>>  os.path.isdir(os.path.join(path, p))]
You ought to be able to gain a little by hoisting the os.path.xxx
attribute lookups for join and isdir out of the loop. E.g, (not tested)

    opj=os.path.join; oisd=os.path.isdir
    [opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]

But it seems like you are asking the os to chase through full paths at
every isdir operation, rather than just telling it to make its current working
directory the directory you are interested in and doing it there. E.g., (untested)

    savedir = os.getcwd()
    os.chdir(path)
    dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
    os.chdir(savedir)

>> 
>>  to get a list of folders in a given directory, skipping all plain
>>  files. When used on folders with lots of files,  it takes rather long
>>  time to finish. Just doing  a listdir, filtering out all plain files
>>  and a couple of joins, I didn't think this would take so long. 
>
I'd be curious to know how much difference the above would make.

>How many files, what OS and what filing system?
>
>Under a unix based OS the above will translate to 1
>opendir()/readdir()/closedir() and 1 stat() for each file.  There
>isn't a quicker way in terms of system calls AFAIK.
Except IWT chdir could help there too?
>
>However some filing systems handle lots of files in a directory better
>than others.  Eg reiserfs is much better than ext2/3 for this purpose.
>(ext3 has a dirhash module to fix this in the works though).
>
>Eg on my linux box, running ext3, with various numbers of files in a
>directory :-
>
>/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
>'[os.path.join(path, p) for p in os.listdir(path) if \
>os.path.isdir(os.path.join(path, p))]'
>
path="." might be a special case in some of that though. I would
try a long absolute path for comparison. (What did the OP have as
actual use case?)

>Files  Time
>
>10     3.01e+02 usec per loop
>100    2.74e+03 usec per loop
>1000   2.73e+04 usec per loop
>10000  2.76e+05 usec per loop
>100000 2.81e+06 usec per loop
>
>Which is pretty linear... much more so than I expected!
>
>The above timings ignore the effect of caching - will the directory
>you are enumerating be hot in the cache?
Even if so, I doubt the os finds it via a hash of the full path instead
of checking that every element of the path exists and is a subdirectory.
IWT that could be a dangerous short cut, whereas chdir and using the cwd
should be fast and safe and most likely guarantee cached content availability.

Just guessing, though ;-)

>
>Something similar may apply under Windows but I don't know ;-)
>
>-- 
>Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick

Regards,
Bengt Richter