[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Mon Jun 30 19:05:54 CEST 2014

> So, here's my alternative proposal: add an "ensure_lstat" flag to
> scandir() itself, and don't have *any* methods on DirEntry, only
> attributes.
>
> That would make the DirEntry attributes:
>
>     is_dir: boolean, always populated
>     is_file: boolean, always populated
>     is_symlink boolean, always populated
>     lstat_result: stat result, may be None on POSIX systems if
> ensure_lstat is False
>
> (I'm not particularly sold on "lstat_result" as the name, but "lstat"
> reads as a verb to me, so doesn't sound right as an attribute name)
>
> What this would allow:
>
> - by default, scanning is efficient everywhere, but lstat_result may
> be None on POSIX systems
> - if you always need the lstat result, setting "ensure_lstat" will
> trigger the extra system call implicitly
> - if you only sometimes need the stat result, you can call os.lstat()
> explicitly when the DirEntry lstat attribute is None
>
> Most importantly, *regardless of platform*, the cached stat result (if
> not None) would reflect the state of the entry at the time the
> directory was scanned, rather than at some arbitrary later point in
> time when lstat() was first called on the DirEntry object.
>
> There'd still be a slight window of discrepancy (since the filesystem
> state may change between reading the directory entry and making the
> lstat() call), but this could be effectively eliminated from the
> perspective of the Python code by making the result of the lstat()
> call authoritative for the whole DirEntry object.

Yeah, I quite like this. It does make the caching more explicit and
consistent. It's slightly annoying that it's less like pathlib.Path
now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't
matter. The differences in naming may highlight the difference in
caching, so maybe it's a good thing.

Two further questions from me:

1) How does error handling work? Now os.stat() will/may be called
during iteration, so in __next__. But it hard to catch errors because
you don't call __next__ explicitly. Is this a problem? How do other
iterators that make system calls or raise errors handle this?

2) There's still the open question in the PEP of whether to include a
way to access the full path. This is cheap to build, it has to be
built anyway on POSIX systems, and it's quite useful for further
operations on the file. I think the best way to handle this is a
.fullname or .full_name attribute as suggested elsewhere. Thoughts?

-Ben