[Python-checkins] peps: PEP 471: Ben Hoyt updates
victor.stinner
python-checkins at python.org
Fri Jul 18 18:30:50 CEST 2014
http://hg.python.org/peps/rev/40a6c3b54559
changeset: 5499:40a6c3b54559
user: Victor Stinner <victor.stinner at gmail.com>
date: Fri Jul 18 18:25:41 2014 +0200
summary:
PEP 471: Ben Hoyt updates
files:
pep-0471.txt | 318 +++++++++++++++++++++++++-------------
1 files changed, 205 insertions(+), 113 deletions(-)
diff --git a/pep-0471.txt b/pep-0471.txt
--- a/pep-0471.txt
+++ b/pep-0471.txt
@@ -8,7 +8,7 @@
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5
-Post-History: 27-Jun-2014, 8-Jul-2014
+Post-History: 27-Jun-2014, 8-Jul-2014, 14-Jul-2014, 18-Jul-2014
Abstract
@@ -16,9 +16,9 @@
This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
-useful functionality and increases the speed of ``os.walk()`` by 2-10
-times (depending on the platform and file system) by significantly
-reducing the number of times ``stat()`` needs to be called.
+useful functionality and increases the speed of ``os.walk()`` by 2-20
+times (depending on the platform and file system) by avoiding calls to
+``os.stat()`` in most cases.
Rationale
@@ -34,8 +34,8 @@
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
already tell you whether the files returned are directories or not, so
no further system calls are needed. Further, the Windows system calls
-return all the information for a ``stat_result`` object, such as file
-size and last modification time.
+return all the information for a ``stat_result`` object on the directory
+entry, such as file size and last modification time.
In short, you can reduce the number of system calls required for a
tree function like ``os.walk()`` from approximately 2N to N, where N
@@ -56,7 +56,7 @@
memory efficiency for iterating very large directories.
So, as well as providing a ``scandir()`` iterator function for calling
-directly, Python's existing ``os.walk()`` function could be sped up a
+directly, Python's existing ``os.walk()`` function can be sped up a
huge amount.
.. _`Issue 11406`: http://bugs.python.org/issue11406
@@ -67,7 +67,8 @@
The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
-module). It lives on GitHub at `benhoyt/scandir`_.
+module). It lives on GitHub at `benhoyt/scandir`_. (The implementation
+may lag behind the updates to this PEP a little.)
.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir
@@ -82,67 +83,83 @@
Specifics of proposal
=====================
+os.scandir()
+------------
+
Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::
- scandir(path='.') -> generator of DirEntry objects
+ scandir(directory='.') -> generator of DirEntry objects
Like ``listdir``, ``scandir`` calls the operating system's directory
-iteration system calls to get the names of the files in the ``path``
-directory, but it's different from ``listdir`` in two ways:
+iteration system calls to get the names of the files in the given
+``directory``, but it's different from ``listdir`` in two ways:
* Instead of returning bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the additional data the
- operating system returned.
+ operating system may have returned.
* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.
-``scandir()`` yields a ``DirEntry`` object for each file and directory
-in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
-pseudo-directories are skipped, and the entries are yielded in
-system-dependent order. Each ``DirEntry`` object has the following
-attributes and methods:
+``scandir()`` yields a ``DirEntry`` object for each file and
+sub-directory in ``directory``. Just like ``listdir``, the ``'.'``
+and ``'..'`` pseudo-directories are skipped, and the entries are
+yielded in system-dependent order. Each ``DirEntry`` object has the
+following attributes and methods:
-* ``name``: the entry's filename, relative to the ``path`` argument
- (corresponds to the return values of ``os.listdir``)
+* ``name``: the entry's filename, relative to the ``directory``
+ argument (corresponds to the return values of ``os.listdir``)
-* ``full_name``: the entry's full path name -- the equivalent of
- ``os.path.join(path, entry.name)``
+* ``path``: the entry's full path name (not necessarily an absolute
+ path) -- the equivalent of ``os.path.join(directory, entry.name)``
-* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
- requires a system call on Windows, and usually doesn't on POSIX
- systems
+* ``is_dir(*, follow_symlinks=True)``: similar to
+ ``pathlib.Path.is_dir()``, but the return value is cached on the
+ ``DirEntry`` object; doesn't require a system call in most cases;
+ don't follow symbolic links if ``follow_symlinks`` is False
-* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
- never requires a system call on Windows, and usually doesn't on
- POSIX systems
+* ``is_file(*, follow_symlinks=True)``: similar to
+ ``pathlib.Path.is_file()``, but the return value is cached on the
+ ``DirEntry`` object; doesn't require a system call in most cases;
+ don't follow symbolic links if ``follow_symlinks`` is False
-* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
- never requires a system call on Windows, and usually doesn't on
- POSIX systems
+* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
+ return value is cached on the ``DirEntry`` object; doesn't require a
+ system call in most cases
-* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
- -- it only requires a system call on POSIX systems
+* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
+ return value is cached on the ``DirEntry`` object; does not require a
+ system call on Windows (except for symlinks); don't follow symbolic links
+ (like ``os.lstat()``) if ``follow_symlinks`` is False
-The ``is_X`` methods may perform a ``stat()`` call under certain
-conditions (for example, on certain file systems on POSIX systems),
-and therefore possibly raise ``OSError``. The ``lstat()`` method will
-call ``stat()`` on POSIX systems and therefore also possibly raise
-``OSError``. See the "Notes on exception handling" section for more
-details.
+All *methods* may perform system calls in some cases and therefore
+possibly raise ``OSError`` -- see the "Notes on exception handling"
+section for more details.
The ``DirEntry`` attribute and method names were chosen to be the same
-as those in the new ``pathlib`` module for consistency.
+as those in the new ``pathlib`` module where possible, for
+consistency. The only difference in functionality is that the
+``DirEntry`` methods cache their values on the entry object after the
+first call.
Like the other functions in the ``os`` module, ``scandir()`` accepts
-either a bytes or str object for the ``path`` parameter, and returns
-the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
-same type as ``path``. However, it is *strongly recommended* to use
-the str type, as this ensures cross-platform support for Unicode
-filenames.
+either a bytes or str object for the ``directory`` parameter, and
+returns the ``DirEntry.name`` and ``DirEntry.path`` attributes with
+the same type as ``directory``. However, it is *strongly recommended*
+to use the str type, as this ensures cross-platform support for
+Unicode filenames. (On Windows, bytes filenames have been deprecated
+since Python 3.3).
+
+os.walk()
+---------
+
+As part of this proposal, ``os.walk()`` will also be modified to use
+``scandir()`` rather than ``listdir()`` and ``os.path.isdir()``. This
+will increase the speed of ``os.walk()`` very significantly (as
+mentioned above, by 2-20 times, depending on the system).
Examples
@@ -154,7 +171,7 @@
dirs = []
non_dirs = []
- for entry in os.scandir(path):
+ for entry in os.scandir(directory):
if entry.is_dir():
dirs.append(entry)
else:
@@ -165,19 +182,25 @@
and POSIX systems.
Or, for getting the total size of files in a directory tree, showing
-use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
+use of the ``DirEntry.stat()`` method and ``DirEntry.path``
attribute::
- def get_tree_size(path):
- """Return total size of files in path and subdirs."""
+ def get_tree_size(directory):
+ """Return total size of files in directory and subdirs."""
total = 0
- for entry in os.scandir(path):
- if entry.is_dir():
- total += get_tree_size(entry.full_name)
+ for entry in os.scandir(directory):
+ if entry.is_dir(follow_symlinks=False):
+ total += get_tree_size(entry.path)
else:
- total += entry.lstat().st_size
+ total += entry.stat(follow_symlinks=False).st_size
return total
+This also shows the use of the ``follow_symlinks`` parameter to
+``is_dir()`` -- in a recursive function like this, we probably don't
+want to follow links. (To properly follow links in a recursive
+function like this we'd want special handling for the case where
+following a symlink leads to a recursive loop.)
+
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
@@ -188,10 +211,10 @@
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
-``full_name`` attributes are obviously always cached, and the ``is_X``
-and ``lstat`` methods cache their values (immediately on Windows via
+``path`` attributes are obviously always cached, and the ``is_X``
+and ``stat`` methods cache their values (immediately on Windows via
``FindNextFile``, and on first use on POSIX systems via a ``stat``
-call) and never refetch from the system.
+system call) and never refetch from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
@@ -199,50 +222,61 @@
If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use ``pathlib.Path`` objects,
-or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
+or call the regular ``os.stat()`` or ``os.path.getsize()`` functions
which get fresh data from the operating system every call.
Notes on exception handling
---------------------------
-``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
+``DirEntry.is_X()`` and ``DirEntry.stat()`` are explicitly methods
rather than attributes or properties, to make it clear that they may
-not be cheap operations, and they may do a system call. As a result,
-these methods may raise ``OSError``.
+not be cheap operations (although they often are), and they may do a
+system call. As a result, these methods may raise ``OSError``.
-For example, ``DirEntry.lstat()`` will always make a system call on
+For example, ``DirEntry.stat()`` will always make a system call on
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
-``stat()`` system call on such systems if ``readdir()`` returns a
-``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
-certain conditions or on certain file systems.
+``stat()`` system call on such systems if ``readdir()`` does not
+support ``d_type`` or returns a ``d_type`` with a value of
+``DT_UNKNOWN``, which can occur under certain conditions or on
+certain file systems.
-For this reason, when a user requires fine-grained error handling,
-it's good to catch ``OSError`` around these method calls and then
-handle as appropriate.
+Often this does not matter -- for example, ``os.walk()`` as defined in
+the standard library only catches errors around the ``listdir()``
+calls.
+
+Also, because the exception-raising behaviour of the ``DirEntry.is_X``
+methods matches that of ``pathlib`` -- which only raises ``OSError``
+in the case of permissions or other fatal errors, but returns False
+if the path doesn't exist or is a broken symlink -- it's often
+not necessary to catch errors around the ``is_X()`` calls.
+
+However, when a user requires fine-grained error handling, it may be
+desirable to catch ``OSError`` around all method calls and handle as
+appropriate.
For example, below is a version of the ``get_tree_size()`` example
-shown above, but with basic error handling added::
+shown above, but with fine-grained error handling added::
- def get_tree_size(path):
- """Return total size of files in path and subdirs. If
- is_dir() or lstat() fails, print an error message to stderr
+ def get_tree_size(directory):
+ """Return total size of files in directory and subdirs. If
+ is_dir() or stat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
- for entry in os.scandir(path):
+ for entry in os.scandir(directory):
try:
- is_dir = entry.is_dir()
+ is_dir = entry.is_dir(follow_symlinks=False)
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
- total += get_tree_size(entry.full_name)
+ total += get_tree_size(entry.path)
else:
try:
- total += entry.lstat().st_size
+ total += entry.stat(follow_symlinks=False).st_size
except OSError as error:
- print('Error calling lstat():', error, file=sys.stderr)
+ print('Error calling stat():', error, file=sys.stderr)
return total
@@ -316,6 +350,12 @@
Seems pretty solid, so first thing, just want to say nice work!"
[via personal email]
+* Matt Z: "I used scandir to dump the contents of a network dir in
+ under 15 seconds. 13 root dirs, 60,000 files in the structure. This
+ will replace some old VBA code embedded in a spreadsheet that was
+ taking 15-20 minutes to do the exact same thing." [via personal
+ email]
+
Others have `requested a PyPI package`_ for it, which has been
created. See `PyPI package`_.
@@ -331,13 +371,11 @@
* Forks: 20
* Issues: 4 open, 26 closed
-**However, the much larger point is this:**, if this PEP is accepted,
-``os.walk()`` can easily be reimplemented using ``scandir`` rather
-than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
-very significantly. There are thousands of developers, scripts, and
-production code that would benefit from this large speedup of
-``os.walk()``. For example, on GitHub, there are almost as many uses
-of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
+Also, because this PEP will increase the speed of ``os.walk()``
+significantly, there are thousands of developers and scripts, and a lot
+of production code, that would benefit from it. For example, on GitHub,
+there are almost as many uses of ``os.walk`` (194,000) as there are of
+``os.mkdir`` (230,000).
Rejected ideas
@@ -392,12 +430,51 @@
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
+Methods not following symlinks by default
+-----------------------------------------
+
+There was much debate on python-dev (see messages in `this thread
+<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_)
+over whether the ``DirEntry`` methods should follow symbolic links or
+not (when the ``is_X()`` methods had no ``follow_symlinks`` parameter).
+
+Initially they did not (see previous versions of this PEP and the
+scandir.py module), but Victor Stinner made a pretty compelling case on
+python-dev that following symlinks by default is a better idea, because:
+
+* following links is usually what you want (in 92% of cases in the
+ standard library, functions using ``os.listdir()`` and
+ ``os.path.isdir()`` do follow symlinks)
+
+* that's the precedent set by the similar functions
+ ``os.path.isdir()`` and ``pathlib.Path.is_dir()``, so to do
+ otherwise would be confusing
+
+* with the non-link-following approach, if you wanted to follow links
+ you'd have to say something like ``if (entry.is_symlink() and
+ os.path.isdir(entry.path)) or entry.is_dir()``, which is clumsy
+
+As a case in point that shows the non-symlink-following version is
+error prone, this PEP's author had a bug caused by getting this
+exact test wrong in his initial implementation of ``scandir.walk()``
+in scandir.py (see `Issue #4 here
+<https://github.com/benhoyt/scandir/issues/4>`_).
+
+In the end there was not total agreement that the methods should
+follow symlinks, but there was basic consensus among the most involved
+participants, and this PEP's author believes that the above case is
+strong enough to warrant following symlinks by default.
+
+In addition, it's straight-forward to call the relevant methods with
+``follow_symlinks=False`` if the other behaviour is desired.
+
+
DirEntry attributes being properties
------------------------------------
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
-``lstat()`` to be properties instead of methods, to indicate they're
-very cheap or free. However, this isn't quite the case, as ``lstat()``
+``stat()`` to be properties instead of methods, to indicate they're
+very cheap or free. However, this isn't quite the case, as ``stat()``
will require an OS call on POSIX-based systems but not on Windows.
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
@@ -422,8 +499,8 @@
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
Paul Moore suggested a solution that was a "thin wrapper round the OS
feature", where the ``DirEntry`` object had only static attributes:
-``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
-only present on Windows. The idea was to use this simpler, lower-level
+``name``, ``path``, and ``is_X``, with the ``st_X`` attributes only
+present on Windows. The idea was to use this simpler, lower-level
function as a building block for higher-level functions.
At first there was general agreement that simplifying in this way was
@@ -459,19 +536,24 @@
``OSError``) during iteration, leading to a rather ugly, hand-made
iteration loop::
- it = os.scandir(path)
+ it = os.scandir(directory)
while True:
try:
entry = next(it)
except OSError as error:
- handle_error(path, error)
+ handle_error(directory, error)
except StopIteration:
break
Or it means that ``scandir()`` would have to accept an ``onerror``
argument -- a function to call when ``stat()`` errors occur during
iteration. This seems to this PEP's author neither as direct nor as
-Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
+Pythonic as ``try``/``except`` around a ``DirEntry.stat()`` call.
+
+Another drawback is that ``os.scandir()`` is written to make code faster.
+Always calling ``os.lstat()`` on POSIX would not bring any speedup. In most
+cases, you don't need the full ``stat_result`` object -- the ``is_X()``
+methods are enough and this information is already known.
See `Ben Hoyt's July 2014 reply
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
@@ -513,7 +595,7 @@
--------------------------------------------------
Another alternative discussed was making the return values to be
-overloaded ``stat_result`` objects with ``name`` and ``full_name``
+overloaded ``stat_result`` objects with ``name`` and ``path``
attributes. However, apart from this being a strange (and strained!)
kind of overloading, this has the same problems mentioned above --
most of the ``stat_result`` information is not fetched by
@@ -526,15 +608,15 @@
With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
-``lstat()`` functions are explicitly not cached, whereas ``scandir``
+``stat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
-lstat values, but the ordinary ``pathlib.Path`` objects explicitly
+stat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.
-Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
+Guido van Rossum explicitly rejected ``pathlib.Path`` caching stat in
the context of scandir `here
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
making ``pathlib.Path`` objects a bad choice for scandir return
@@ -564,35 +646,45 @@
Previous discussion
===================
-* `Original thread Ben Hoyt started on python-ideas`_ about speeding
- up ``os.walk()``
+* `Original November 2012 thread Ben Hoyt started on python-ideas
+ <https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
+ about speeding up ``os.walk()``
* Python `Issue 11406`_, which includes the original proposal for a
scandir-like function
-* `Further thread Ben Hoyt started on python-dev`_ that refined the
- ``scandir()`` API, including Nick Coghlan's suggestion of scandir
- yielding ``DirEntry``-like objects
+* `Further May 2013 thread Ben Hoyt started on python-dev
+ <https://mail.python.org/pipermail/python-dev/2013-May/126119.html>`_
+ that refined the ``scandir()`` API, including Nick Coghlan's
+ suggestion of scandir yielding ``DirEntry``-like objects
-* `Another thread Ben Hoyt started on python-dev`_ to discuss the
- interaction between scandir and the new ``pathlib`` module
+* `November 2013 thread Ben Hoyt started on python-dev
+ <https://mail.python.org/pipermail/python-dev/2013-November/130572.html>`_
+ to discuss the interaction between scandir and the new ``pathlib``
+ module
-* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
- version of this PEP, with extensive discussion about the API.
+* `June 2014 thread Ben Hoyt started on python-dev
+ <https://mail.python.org/pipermail/python-dev/2014-June/135215.html>`_
+ to discuss the first version of this PEP, with extensive discussion
+ about the API
-* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
- pointers on how to fix it (this inspired the author of this PEP
- early on)
+* `First July 2014 thread Ben Hoyt started on python-dev
+ <https://mail.python.org/pipermail/python-dev/2014-July/135377.html>`_
+ to discuss his updates to PEP 471
-* `BetterWalk`_, this PEP's author's previous attempt at this, on
- which the scandir code is based
+* `Second July 2014 thread Ben Hoyt started on python-dev
+ <https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_
+ to discuss the remaining decisions needed to finalize PEP 471,
+ specifically whether the ``DirEntry`` methods should follow symlinks
+ by default
-.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
-.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
-.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
-.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
-.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
-.. _`BetterWalk`: https://github.com/benhoyt/betterwalk
+* `Question on StackOverflow
+ <http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder>`_
+ about why ``os.walk()`` is slow and pointers on how to fix it (this
+ inspired the author of this PEP early on)
+
+* `BetterWalk <https://github.com/benhoyt/betterwalk>`_, this PEP's
+ author's previous attempt at this, on which the scandir code is based
Copyright
--
Repository URL: http://hg.python.org/peps
More information about the Python-checkins
mailing list