[Python-ideas] BetterWalk, a better and faster os.walk() for Python
Andrew Barnert
abarnert at yahoo.com
Sun Nov 25 04:55:51 CET 2012
From: Mike Meyer <mwm at mired.org>
Sent: Sat, November 24, 2012 4:56:16 PM
> On Sat, Nov 24, 2012 at 5:27 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> > (In fact, I think an os.walk replacement based on the fts API, which
> > never uses iterdir_stat, would be the best answer, but let me put
> > more thought into that idea...)
>
> A couple of us have looked into this, and there's an impedance
> mismatch between the os.walk API and the fts API, in that fts doesn't
> provide the d_type information, so you're forced to make the stat
> calls that this rewrite was trying to avoid.
Actually, fts does provide all the information you need in fts_info—but you
don't need it anyway, because fts itself is driving the walk, not os.walk. Using
fts to implement iterdir_stat, and then using that to implement os.walk, may or
may not be doable, but it's silly.
> If you're thinking about providing an API that looks more like fts,
> that might be a better answer - if you design it so it also works on
> Windows.
Yes, that was the idea, "an os.walk replacement based on the fts API", as
opposed to "a reimplementation of os.walk that uses fts". I believe implementing
the fts interface on Windows and other non-POSIX platforms would be much easier
than implementing os.walk efficiently on top of iterdir_stat. Also, it would
ensure that all of the complexities had been thought through (following broken
links, detecting hardlink and symlink cycles, etc.). And I believe the interface
is actually better.
That last claim may be a bit controversial. But simple use cases are trivially
convertible between os.walk and fts, while for anything less simple, I think fts
is much more readable.
For example, if you want to skip over any subdirectories whose names start with
'.':
# make sure you passed topdown=True or this will silently do nothing
for i, dn in reversed(enumerate(dirnames)):
if dn.startswith('.'):
del dirnames[i]
if ent.info == fts.D and ent.name.startswith('.'):
f.skip(ent)
Or, if you only want to traverse 3 levels down:
if len(os.path.split(dirpath)) > 2:
del dirnames[:]
if ent.depth > 3:
f.skip(ent)
Or, if you want to avoid traversing devices:
# make sure you passed topdown=True
fs0 = os.stat(dirname)
for i, dirname in reversed(enumerate(dirnames)):
fs1 = os.stat(os.path.join(dirpath, dirname))
if fs0.st_dev != fs1.st_dev:
del dirnames[i]
# just add fts.NODEV to the open options
In every case, the fts version is simpler, more explicit, more concise, and
easier to get right.
As a 0th draft, the only changes to the interface described
at http://nixdoc.net/man-pages/FreeBSD/man3/fts_open.3.html would be:
* fts.open instead of fts_open.
* open returns an FTS object, which is a context manager and an iterator (which
just calls read).
* Everything else is a method on the FTS object, so you call f.read() instead
of fts_read(f).
* open takes key and reverse params instead of compar, which work just as in
list.sort, except that key=None means no sorting. If you want to pass a
compar-like function, use functools.cmp_to_key.
* key=None doesn't guarantee that files are listed "in the order listed in the
directory"; they're listed in an unspecified order.
* setclient, getclient, getstream are unnecessary, as are the number and
pointer fields in FTSENT; if you want to pass user data down to your key
function, just bind it in.
* read returns an FTSENT, which has all the same fields as the C API struct,
but without the fts_ prefixes, except that number and pointer may not be
present.
* open's options are the same ones defined in fts.h, but without the FTS_
prefix, and if you pass neither PHYSICAL nor LOGICAL, you get PHYSICAL rather
than an error.
The default is PHYSICAL, instead of 0 (which just returns an error if you pass
it).
* Not passing NOCHDIR doesn't guarantee that fts does chdir, just allows it to.
This is actually already true for the BSD interface, but every popular
implementation does the same thing (actually, they're all trivial variations on
the same implementation), so people write code that relies on it, so the
documentation should explicitly say that it's not guaranteed.
* Instead of fts_set, provide specific again, follow, and skip methods.
* children returns a list of FTSENT objects, instead of returning the first one
with ent.link holding the next one.
I think we could add an fts.walk that's a drop-in replacement for simple uses of
os.walk, but doesn't try to provide the entire interface, which could be useful
as a transition. But I don't think it's necessary.
Another potential option would be to make the iterator close itself when
consumed, so you don't need to make it a context manager, again making it easier
to drop in as a replacement for os.walk, but again I don't think it's necessary.
On POSIX-like platforms without native fts, the FreeBSD fts.c is almost
completely portable. It does rely on fchdir and statfs, but these can both be
stubbed out. If you don't have fchdir, you always get the equivalent of NOCHDIR.
If you don't have statfs, you don't get the ufslinks optimization.
The quick&dirty ctypes wrapper around native fts
at https://github.com/abarnert/py-fts approximates this interface. However, I'll
try to write up a complete implementation that wraps native fts if available, or
does it in pure Python with iterdir_stat (possibly with some minor changes) and
existing stuff in os otherwise. I suspect I'll come up with a few changes while
implementing it—and of course it's possible I'll find out that it's not so easy
to do on Windows, but at the moment I'm pretty confident it will be.
More information about the Python-ideas
mailing list