[Python-Dev] os.path.walk() lacks 'depth first' option

Tim Peters tim.one@comcast.net
Mon, 21 Apr 2003 15:15:54 -0400


This is a multi-part message in MIME format.

--Boundary_(ID_obWR1ARDlxFPCnCmvHvojQ)
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 7BIT

[Guido]
>>> But if I had to do it over again, I wouldn't have added walk() in the
>>> current form.

[Neil Schemenauer]
>> I think it's the perfect place for a generator.

[Guido]
> Absolutely!  So let's try to write something new based on generators,
> make it flexible enough so that it can handle pre-order or post-order
> visits, and then phase out os.walk().

I posted one last night, with a bug (it failed to pass the topdown flag
through to recursive calls).

Here's that again, with the bug repaired, sped up some, and with a
docstring.  Double duty:  the example in the docstring shows why we don't
want to make a special case out of sum([]):  empty lists can arise
naturally.

What else would people like in this?  I really like separating the directory
names from the plain-file names, so don't bother griping about that <wink>.

It's at least as fast as the current os.path.walk() (it's generally faster
for me, but times for this are extremely variable on Win98).  Removing the
internal recursion doesn't appear to make a measureable difference when
walking my Python tree, although because recursive generators require time
proportional to the current stack depth to deliver a result to the caller,
and to resume again, removing recursion could be much more efficient on an
extremely deep tree.  The biggest speedup I could find on Windows was via
using os.chdir() liberally, so that os.path.join() calls weren't needed, and
os.path.isdir() calls worked directly on one-component names.  I suspect
this has to do with that Win98 doesn't have an effective way to cache
directory lookups under the covers.  Even so, it only amounted to a 10%
speedup:  directory walking is plain slow on Win98 no matter how you do it.
The attached doesn't play any gross speed tricks.

--Boundary_(ID_obWR1ARDlxFPCnCmvHvojQ)
Content-type: text/plain; name=walk.py
Content-transfer-encoding: 7BIT
Content-disposition: attachment; filename=walk.py

def walk(top, topdown=True):
    """Directory tree generator.

    For each directory in the directory tree rooted at top (including top
    itself, but excluding '.' and '..'), yields a 3-tuple

        dirpath, dirnames, filenames

    dirpath is a string, the path to the directory.  dirnames is a list of
    the names of the subdirectories in dirpath (excluding '.' and '..').
    filenames is a list of the names of the non-directory files in dirpath.

    If optional arg 'topdown' is true or not specified, the triple for a
    directory is generated before the triples for any of its subdirectories
    (directories are generated top down).  If topdown is false, the triple
    for a directory is generated after the triples for all of its
    subdirectories (directories are generated bottom up).

    When topdown is true, the caller can modify the dirnames list in-place
    (e.g. via del or slice assignment), and walk will only recurse into the
    subdirectories whose names remain in dirnames; this can be used to prune
    the search, or to impose a specific order of visiting.  Modifying
    dirnames when topdown is false is ineffective, since the directories in
    dirnames have already been generated by the time dirnames itself is
    generated.

    Caution:  if you pass a relative pathname for top, don't change the
    current working directory between resumptions of walk.

    Example:

    from os.path import join, getsize
    for root, dirs, files in walk('python/Lib/email'):
        print root, "consumes",
        print sum([getsize(join(root, name)) for name in files]),
        print "bytes in", len(files), "non-directory files"
        if 'CVS' in dirs:
            dirs.remove('CVS')  # don't visit CVS directories
    """

    import os
    from os.path import join, isdir

    try:
        names = os.listdir(top)
    except os.error:
        return

    exceptions = ('.', '..')
    dirs, nondirs = [], []
    for name in names:
        if name not in exceptions:
            if isdir(join(top, name)):
                dirs.append(name)
            else:
                nondirs.append(name)
    if topdown:
        yield top, dirs, nondirs
    for name in dirs:
        for x in walk(join(top, name), topdown):
            yield x
    if not topdown:
        yield top, dirs, nondirs

--Boundary_(ID_obWR1ARDlxFPCnCmvHvojQ)--