[Python-Dev] Updates to PEP 471, the os.scandir() proposal

Victor Stinner victor.stinner at gmail.com
Thu Jul 10 02:15:58 CEST 2014


2014-07-09 17:29 GMT+02:00 Ben Hoyt <benhoyt at gmail.com>:
>> Would this not "break" the tree size script being discussed in the
>> other thread, as it would follow links and include linked directories
>> in the "size" of the tree?

The get_tree_size() function in the PEP would use: "if not
entry.is_symlink() and entry.is_dir():".

Note: First I wrote "if entry.is_dir() and not entry.is_symlink():",
but this syntax is slower on Linux because is_dir() has to call
lstat().

Adding an optional keyword to DirEntry.is_dir() would allow to write
"if entry.is_dir(follow_symlink=False)", but it looks like a micro
optimization and as I said, I prefer to stick to pathlib.Path API
(which was already heavily discussed in its PEP). Anyway, this case is
rare (I explain that below), we should not worry too much about it.

> Yeah, I agree. Victor -- I don't think the DirEntry is_X() methods (or
> attributes) should mimic the link-following os.path.isdir() at all.
> You want the type of the entry, not the type of the source.

On UNIX, a symlink to a directory is expected to behave like a
directory. For example, in a file browser, you should enter in the
linked directory when you click on a symlink to a directory.

There are only a few cases where you want to handle symlinks
differently: archive (ex: tar), compute the size of a directory (ex:
du does not follow symlinks by default, du -L follows them), remove a
directory.

You should do a short poll in the Python stdlib and on the Internet to
check what is the most common check.

Examples of the Python stdlib:

- zipfile: listdir + os.path.isdir
- pkgutil: listdir + os.path.isdir
- unittest.loader: listdir + os.path.isdir and os.path.isfile
- http.server: listdir + os.path.isdir, it also uses os.path.islink: "
Append / for directories or @ for symbolic links "
- idlelib.GrepDialog: listdir + os.path.isdir
- compileall: listdir + os.path.isdir and "os.path.isdir(fullname) and
not os.path.islink(fullname)" <= don't follow symlinks to directories
- shutil (copytree): listdir + os.path.isdir + os.path.islink
- shutil (rmtree): listdir + os.lstat() + stat.S_ISDIR(mode) <= don't
follow symlinks to directories
- mailbox: listdir + os.path.isdir
- tabnanny: listdir + os.path.isdir
- os.walk: listdir + os.path.isdir + os.path.islink <= don't follow
symlinks to directories by default, but the behaviour is configurable
... but symlinks to directories are added to the "dirs" list (not all
symlinks, only symlinks to directories)
- setup.py: listdir + os.path.isfile

In this list of 12 examples, only compileall, shutil.rmtree and
os.walk check if entries are symlinks. compileall starts by checking
"if not os.path.isdir(fullname):" which follows symlinks. os.walk()
starts by checking "if os.path.isdir(name):" which follows symlinks. I
consider that only one case on 12 (8.3%) doesn't follow symlinks.

If entry.is_dir() doesn't follow symlinks, the other 91.7% will need
to be modified to use "if entry.is_dir() or (entry.is_link() and
os.path.is_dir(entry.full_name)):" to keep the same behaviour :-(

> Otherwise, as Paul says, you are essentially forced to follow links,
> and os.walk(followlinks=False), which is the default, can't do the
> right thing.

os.walk() and get_tree_size() are good users of scandir(), but they
are recursive functions. It means that you may handle symlinks
differently, os.walk() gives the choice to follow or not symlinks for
example.

Recursive functions are rare. The most common case is to list files of
a single directory and then filter files depending on various filters
(is a file? is a directory? match the file name? ...). In such use
case, you don't "care" of symlinks (you want to follow them).

Victor


More information about the Python-Dev mailing list