[Python-ideas] BetterWalk, a better and faster os.walk() for Python

Andrew Barnert abarnert at yahoo.com
Sun Nov 25 04:55:51 CET 2012


From: Mike Meyer <mwm at mired.org>
Sent: Sat, November 24, 2012 4:56:16 PM


> On Sat, Nov 24, 2012 at 5:27 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> > (In  fact, I think an os.walk replacement based on the fts API, which
> > never  uses iterdir_stat, would be the best answer, but let me put
> > more thought  into that idea...)
> 
> A couple of us have looked into this, and there's an  impedance
> mismatch between the os.walk API and the fts API, in that fts  doesn't
> provide the d_type information, so you're forced to make the  stat
> calls that this rewrite was trying to avoid.

Actually, fts does provide all the information you need in fts_info—but you 
don't need it anyway, because fts itself is driving the walk, not os.walk. Using 
fts to implement iterdir_stat, and then using that to implement os.walk, may or 
may not be doable, but it's silly.

> If you're thinking  about providing an API that looks more like fts,
> that might be a better  answer - if you design it so it also works  on
> Windows.


Yes, that was the idea, "an os.walk replacement based on the fts API", as 
opposed to "a reimplementation of os.walk that uses fts". I believe implementing 
the fts interface on Windows and other non-POSIX platforms would be much easier 
than implementing os.walk efficiently on top of iterdir_stat. Also, it would 
ensure that all of the complexities had been thought through (following broken 
links, detecting hardlink and symlink cycles, etc.). And I believe the interface 
is actually better.

That last claim may be a bit controversial. But simple use cases are trivially 
convertible between os.walk and fts, while for anything less simple, I think fts 
is much more readable. 

For example, if you want to skip over any subdirectories whose names start with 
'.':

    # make sure you passed topdown=True or this will silently do nothing
    for i, dn in reversed(enumerate(dirnames)):
        if dn.startswith('.'):
            del dirnames[i]

    if ent.info == fts.D and ent.name.startswith('.'):
        f.skip(ent)

Or, if you only want to traverse 3 levels down:

    if len(os.path.split(dirpath)) > 2:
        del dirnames[:]

    if ent.depth > 3:
        f.skip(ent)

Or, if you want to avoid traversing devices:

    # make sure you passed topdown=True
    fs0 = os.stat(dirname)
    for i, dirname in reversed(enumerate(dirnames)):
        fs1 = os.stat(os.path.join(dirpath, dirname))
        if fs0.st_dev != fs1.st_dev:
            del dirnames[i]

    # just add fts.NODEV to the open options

In every case, the fts version is simpler, more explicit, more concise, and 
easier to get right.

As a 0th draft, the only changes to the interface described 
at http://nixdoc.net/man-pages/FreeBSD/man3/fts_open.3.html would be:

 * fts.open instead of fts_open.

 * open returns an FTS object, which is a context manager and an iterator (which 
just calls read).

 * Everything else is a method on the FTS object, so you call f.read() instead 
of fts_read(f).

 * open takes key and reverse params instead of compar, which work just as in 
list.sort, except that key=None means no sorting. If you want to pass a 
compar-like function, use functools.cmp_to_key.

 * key=None doesn't guarantee that files are listed "in the order listed in the 
directory"; they're listed in an unspecified order.

 * setclient, getclient, getstream are unnecessary, as are the number and 
pointer fields in FTSENT; if you want to pass user data down to your key 
function, just bind it in.

 * read returns an FTSENT, which has all the same fields as the C API struct, 
but without the fts_ prefixes, except that number and pointer may not be 
present.

 * open's options are the same ones defined in fts.h, but without the FTS_ 
prefix, and if you pass neither PHYSICAL nor LOGICAL, you get PHYSICAL rather 
than an error.
The default is PHYSICAL, instead of 0 (which just returns an error if you pass 
it).

 * Not passing NOCHDIR doesn't guarantee that fts does chdir, just allows it to. 
This is actually already true for the BSD interface, but every popular 
implementation does the same thing (actually, they're all trivial variations on 
the same implementation), so people write code that relies on it, so the 
documentation should explicitly say that it's not guaranteed.

 * Instead of fts_set, provide specific again, follow, and skip methods.

 * children returns a list of FTSENT objects, instead of returning the first one 
with ent.link holding the next one.

I think we could add an fts.walk that's a drop-in replacement for simple uses of 
os.walk, but doesn't try to provide the entire interface, which could be useful 
as a transition. But I don't think it's necessary.

Another potential option would be to make the iterator close itself when 
consumed, so you don't need to make it a context manager, again making it easier 
to drop in as a replacement for os.walk, but again I don't think it's necessary.

On POSIX-like platforms without native fts, the FreeBSD fts.c is almost 
completely portable. It does rely on fchdir and statfs, but these can both be 
stubbed out. If you don't have fchdir, you always get the equivalent of NOCHDIR. 
If you don't have statfs, you don't get the ufslinks optimization.

The quick&dirty ctypes wrapper around native fts 
at https://github.com/abarnert/py-fts approximates this interface. However, I'll 
try to write up a complete implementation that wraps native fts if available, or 
does it in pure Python with iterdir_stat (possibly with some minor changes) and 
existing stuff in os otherwise. I suspect I'll come up with a few changes while 
implementing it—and of course it's possible I'll find out that it's not so easy 
to do on Windows, but at the moment I'm pretty confident it will be.



More information about the Python-ideas mailing list