[Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info

MRAB python at mrabarnett.plus.com
Fri May 10 16:30:54 CEST 2013


On 10/05/2013 11:55, Ben Hoyt wrote:
> A few of us were having a discussion at
> http://bugs.python.org/issue11406 about adding os.scandir(): a
> generator version of os.listdir() to make iterating over very large
> directories more memory efficient. This also reflects how the OS gives
> things to you -- it doesn't give you a big list, but you call a
> function to iterate and fetch the next entry.
>
> While I think that's a good idea, I'm not sure just that much is
> enough of an improvement to make adding the generator version worth
> it.
>
> But what would make this a killer feature is making os.scandir()
> generate tuples of (name, stat_like_info). The Windows directory
> iteration functions (FindFirstFile/FindNextFile) give you the full
> stat information for free, and the Linux and OS X functions
> (opendir/readdir) give you partial file information (d_type in the
> dirent struct, which is basically the st_mode part of a stat, whether
> it's a file, directory, link, etc).
>
> Having this available at the Python level would mean we can vastly
> speed up functions like os.walk() that otherwise need to make an
> os.stat() call for every file returned. In my benchmarks of such a
> generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X,
> it's more like 1.5-3x. In my opinion, that kind of gain is huge,
> especially on Windows, but also on Linux/OS X.
>
> So the idea is to add this relatively low-level function that exposes
> the extra information the OS gives us for free, but which os.listdir()
> currently throws away. Then higher-level, platform-independent
> functions like os.walk() could use os.scandir() to get much better
> performance. People over at Issue 11406 think this is a good idea.
>
> HOWEVER, there's debate over what kind of object the second element in
> the tuple, "stat_like_info", should be. My strong vote is for it to be
> a stat_result-like object, but where the fields are None if they're
> unknown. There would be basically three scenarios:
>
> 1) stat_result with all fields set: this would happen on Windows,
> where you get as much info from FindFirst/FindNext as from an
> os.stat()
> 2) stat_result with just st_mode set, and all other fields None: this
> would be the usual case on Linux/OS X
> 3) stat_result with all fields None: this would happen on systems
> whose readdir()/dirent doesn't have d_type, or on Linux/OS X when
> d_type was DT_UNKNOWN
>
> Higher-level functions like os.walk() would then check the fields they
> needed are not None, and only call os.stat() if needed, for example:
>
> # Build lists of files and directories in path
> files = []
> dirs = []
> for name, st in os.scandir(path):
>      if st.st_mode is None:
>          st = os.stat(os.path.join(path, name))
>      if stat.S_ISDIR(st.st_mode):
>          dirs.append(name)
>      else:
>          files.append(name)
>
> Not bad for a 2-10x performance boost, right? What do folks think?
>
> Cheers,
> Ben.
>
[snip]
In the python-ideas list there's a thread "PEP: Extended stat_result"
about adding methods to stat_result.

Using that, you wouldn't necessarily have to look at st.st_mode. The 
method could perform an additional os.stat() if the field was None. For
example:

# Build lists of files and directories in path
files = []
dirs = []
for name, st in os.scandir(path):
      if st.is_dir():
          dirs.append(name)
      else:
          files.append(name)

That looks much nicer.


More information about the Python-Dev mailing list