[Python-Dev] PEP 471 (scandir): Poll to choose the implementation (full C or C+Python)

Fri Feb 13 11:07:03 CET 2015

Hi,

TL,DR: are you ok to add 800 lines of C code for os.scandir(), 4x
faster than os.listdir() when the file type is checked?

I accepted the PEP 471 (os.scandir) a few months ago, but it is not
implement yet in Python 3.5, because I didn't make a choice on the
implementation.

Ben Hoyt wrote different implementations:
- full C: os.scandir() and DirEntry are written in C (no change on os.py)
- C+Python: os._scandir() (wrapper for opendir/readdir and
FindFirstFileW/FindNextFileW) in C, DirEntry in Python
- ctypes: os.scandir() and DirEntry fully implemented in Python

I'm not interested by the ctypes implementation. It's useful for a
third party project hosted at PyPI, but for CPython I prefer to wrap C
functions using C code.

In short, the C implementation is faster than the C+Python implementation.

The issue #22524 (*) is full of benchmark numbers. IMO the most
interesting benchmark is to compare os.listdir() + os.stat() versus
os.scandir() + Direntry.is_dir(). Let me try to summarize results of
this benchmark:

* C implementation: scandir is at least 3.5x faster than listdir, up
to 44.6x faster on Windows
* C+Python implementation: scandir is not really faster than listdir,
between 1.3x and 1.4x faster

(*) http://bugs.python.org/issue22524

Ben Hoyt reminded me that os.scandir() (PEP 471) doesn't add any new
feature: pathlib already provides a nice API on top of os and os.path
modules. (You may even notice that DirEntry a much fewer methods ;-))
The main (only?) purpose of the PEP is performance.

If os.scandir() is "only" 1.4x faster, I don't think that it is
interesting to use os.scandir() in an application. I guess that all
applications/libraries will want to keep compatibility with Python 3.4
and older and so will anyway have to duplicate the code to use
os.listdir() + os.stat(). So is it worth to duplicate code for such
small speedup?

Now I see 3 choices:

- take the full C implementation, because it's much faster (at least
3.4x faster!)
- reject the whole PEP 471 (not nice), because it adds too much code
for a minor speedup (not true on Windows: up to 44x faster!)
- take the C+Python implementation, because maintenance matters more
than performances (only 1.3x faster, sorry)

=> IMO the best option is to take the C implementation. What do you think?

I'm concerned by the length of the C code: the full C implementations
adds ~800 lines of C code to posixmodule.c. This file is already the
longest C file in CPython. I don't want to make it longer, but I'm not
motived to start to split it. Last time I proposed to split a file
(unicodeobject.c), some developers complained that it makes search
harder. I don't understand this, there are so many tools to navigate
in C code. But it was enough for me to give up on this idea.

A alternative is to add a new _scandir.c module to host the new C
code, and share some code with posixmodule.c: remove "static" keyword
from required C functions (functions to convert Windows attributes to
a os.stat_result object). That's a reasonable choice. What do you
think?

FYI I ran the benchmark on different hardware (SSD, HDD, tmpfs), file
systems (ext4, tmpfs, NFS/ext4), operating systems (Linux, Windows).

Victor