[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Gregory P. Smith greg at krypto.org
Sat Jun 28 08:17:55 CEST 2014

On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>  * -1 on including Windows specific globbing support in the API
> * -0 on including cross platform globbing support in the initial iteration
> of the API (that could be done later as a separate RFE instead)
Agreed.  Globbing or filtering support should not hold this up.  If that
part isn't settled, just don't include it and work out what it should be as
a future enhancement.

> * +1 on a new section in the PEP covering rejected design options (calling
> it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
+1.  IMNSHO, one of the most important part of PEPs: capturing the entire
decision process to document the "why nots".

> * regarding "why not a 2-tuple", we know from experience that operating
> systems evolve and we end up wanting to add additional info to this kind of
> API. A dedicated DirEntry type lets us adjust the information returned over
> time, without breaking backwards compatibility and without resorting to
> ugly hacks like those in some of the time and stat APIs (or even our own
> codec info APIs)
> * it would be nice to see some relative performance numbers for NFS and
> CIFS network shares - the additional network round trips can make excessive
> stat calls absolutely brutal from a speed perspective when using a network
> drive (that's why the stat caching added to the import system in 3.3
> dramatically sped up the case of having network drives on sys.path, and why
> I thought AJ had a point when he was complaining about the fact we didn't
> expose the dirent data from os.listdir)
fwiw, I wouldn't wait for benchmark numbers.

A needless stat call when you've got the information from an earlier API
call is already brutal. It is easy to compute from existing ballparks
remote file server / cloud access: ~100ms, local spinning disk seek+read:
~10ms. fetch of stat info cached in memory on file server on the local
network: ~500us.  You can go down further to local system call overhead
which can vary wildly but should likely be assumed to be at least 10us.

You don't need a benchmark to tell you that adding needless >= 500us-100ms
blocking operations to your program is bad. :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140627/1e7f3d13/attachment.html>

More information about the Python-Dev mailing list