Optimizing tips for os.listdir
Nick Craig-Wood
nick at craig-wood.com
Mon Sep 27 10:30:18 EDT 2004
Thomas <2002 at weholt.org> wrote:
> I'm doing this :
>
> [os.path.join(path, p) for p in os.listdir(path) if \
> os.path.isdir(os.path.join(path, p))]
>
> to get a list of folders in a given directory, skipping all plain
> files. When used on folders with lots of files, it takes rather long
> time to finish. Just doing a listdir, filtering out all plain files
> and a couple of joins, I didn't think this would take so long.
How many files, what OS and what filing system?
Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.
However some filing systems handle lots of files in a directory better
than others. Eg reiserfs is much better than ext2/3 for this purpose.
(ext3 has a dirhash module to fix this in the works though).
Eg on my linux box, running ext3, with various numbers of files in a
directory :-
/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]'
Files Time
10 3.01e+02 usec per loop
100 2.74e+03 usec per loop
1000 2.73e+04 usec per loop
10000 2.76e+05 usec per loop
100000 2.81e+06 usec per loop
Which is pretty linear... much more so than I expected!
The above timings ignore the effect of caching - will the directory
you are enumerating be hot in the cache?
Something similar may apply under Windows but I don't know ;-)
--
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
More information about the Python-list
mailing list