[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Fri Jun 27 09:44:17 CEST 2014

Hi,

You wrote a great PEP Ben, thanks :-) But it's now time  for comments!

> But the underlying system calls -- ``FindFirstFile`` /
> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --

What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

You should add a link to FindFirstFile doc:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%29.aspx

It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.

The Windows structure also contains

  FILETIME ftCreationTime;
  FILETIME ftLastAccessTime;
  FILETIME ftLastWriteTime;
  DWORD    nFileSizeHigh;
  DWORD    nFileSizeLow;

It would be nice to expose them as well. I'm  no more surprised that
the exact API is different depending on the OS for functions of the os
module.

> * Instead of bare filename strings, it returns lightweight
>   ``DirEntry`` objects that hold the filename string and provide
>   simple methods that allow access to the stat-like data the operating
>   system returned.

Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.

> ``scandir()`` yields a ``DirEntry`` object for each file and directory
> in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
> pseudo-directories are skipped, and the entries are yielded in
> system-dependent order. Each ``DirEntry`` object has the following
> attributes and methods:

Does it support also bytes filenames on UNIX?

Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.

> The ``DirEntry`` attribute and method names were chosen to be the same
> as those in the new ``pathlib`` module for consistency.

Great! That's exactly what I expected :-) Consistency with other modules.

> Notes on caching
> ----------------
>
> The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
> is obviously always cached, and the ``is_X`` and ``lstat`` methods
> cache their values (immediately on Windows via ``FindNextFile``, and
> on first use on Linux / OS X via a ``stat`` call) and never refetch
> from the system.
>
> For this reason, ``DirEntry`` objects are intended to be used and
> thrown away after iteration, not stored in long-lived data structured
> and the methods called again and again.
>
> If a user wants to do that (for example, for watching a file's size
> change), they'll need to call the regular ``os.lstat()`` or
> ``os.path.getsize()`` functions which force a new system call each
> time.

Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.

> Or, for getting the total size of files in a directory tree -- showing
> use of the ``DirEntry.lstat()`` method::
>
>     def get_tree_size(path):
>         """Return total size of files in path and subdirs."""
>         size = 0
>         for entry in scandir(path):
>             if entry.is_dir():
>                 sub_path = os.path.join(path, entry.name)
>                 size += get_tree_size(sub_path)
>             else:
>                 size += entry.lstat().st_size
>         return size
>
> Note that ``get_tree_size()`` will get a huge speed boost on Windows,
> because no extra stat call are needed, but on Linux and OS X the size
> information is not returned by the directory iteration functions, so
> this function won't gain anything there.

I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat(). If you know that it's not a symlink, you
already know the size, but you still have to call stat() to retrieve
all fields required to build a stat_result no?

> Support
> =======
>
> The scandir module on GitHub has been forked and used quite a bit (see
> "Use in the wild" in this PEP),

Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?

> Should scandir be in its own module?
> ------------------------------------
>
> Should the function be included in the standard library in a new
> module, ``scandir.scandir()``, or just as ``os.scandir()`` as
> discussed? The preference of this PEP's author (Ben Hoyt) would be
> ``os.scandir()``, as it's just a single function.

Yes, put it in the os module which is already bloated :-)

> Should there be a way to access the full path?
> ----------------------------------------------
>
> Should ``DirEntry``'s have a way to get the full path without using
> ``os.path.join(path, entry.name)``? This is a pretty common pattern,
> and it may be useful to add pathlib-like ``str(entry)`` functionality.
> This functionality has also been requested in `issue 13`_ on GitHub.
>
> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13

I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.

And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.

> Should it expose Windows wildcard functionality?
> ------------------------------------------------
>
> Should ``scandir()`` have a way of exposing the wildcard functionality
> in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
> scandir module on GitHub exposes this as a ``windows_wildcard``
> keyword argument, allowing Windows power users the option to pass a
> custom wildcard to ``FindFirstFile``, which may avoid the need to use
> ``fnmatch`` or similar on the resulting names. It is named the
> unwieldly ``windows_wildcard`` to remind you you're writing power-
> user, Windows-only code if you use it.
>
> This boils down to whether ``scandir`` should be about exposing all of
> the system's directory iteration features, or simply providing a fast,
> simple, cross-platform directory iteration API.

Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
in os.scandir?

I implemented it in C for the tracemalloc module (Filter object):
http://hg.python.org/features/tracemalloc

Get the revision 69fd2d766005 and search match_filename_joker() in
Modules/_tracemalloc.c. The function matchs the filename backward
because it most cases, the last latter is enough to reject a filename
(ex: "*.jpg" => reject filenames not ending with "g").

The filename is normalized before matching the pattern: converted to
lowercase and / is replaced with \ on Windows.

It was decided to drop the Filter object to keep the tracemalloc
module as simple as possible. Charles-François was not convinced by
the speedup.

But tracemalloc case is different because the OS didn't provide an API for that.

Victor