Optimizing tips for os.listdir

Nick Craig-Wood nick at craig-wood.com
Mon Sep 27 10:30:18 EDT 2004


Thomas <2002 at weholt.org> wrote:
>  I'm doing this :
> 
>  [os.path.join(path, p) for p in os.listdir(path) if \
>  os.path.isdir(os.path.join(path, p))]
> 
>  to get a list of folders in a given directory, skipping all plain
>  files. When used on folders with lots of files,  it takes rather long
>  time to finish. Just doing  a listdir, filtering out all plain files
>  and a couple of joins, I didn't think this would take so long. 

How many files, what OS and what filing system?

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file.  There
isn't a quicker way in terms of system calls AFAIK.

However some filing systems handle lots of files in a directory better
than others.  Eg reiserfs is much better than ext2/3 for this purpose.
(ext3 has a dirhash module to fix this in the works though).

Eg on my linux box, running ext3, with various numbers of files in a
directory :-

/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]'

Files  Time

10     3.01e+02 usec per loop
100    2.74e+03 usec per loop
1000   2.73e+04 usec per loop
10000  2.76e+05 usec per loop
100000 2.81e+06 usec per loop

Which is pretty linear... much more so than I expected!

The above timings ignore the effect of caching - will the directory
you are enumerating be hot in the cache?

Something similar may apply under Windows but I don't know ;-)

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick



More information about the Python-list mailing list