supper fast walk

gangli at msn.com gangli at msn.com
Tue Sep 12 15:12:45 EDT 2000


When I am asked to develop a program to clean up hard disk with huge
collection of directories and files (100K), I convinced my boss to let
me use Python to do it.  I promised I would deliver the program 10
times faster than anybody doing it in C++ or 5 times in Java. I did
deliver it on time, but the program runs very slow.  It was 10 times
slower than using NT GUI Find in simple case.  I used profile to look
into the problem.  The os.path.isdir took 46% cpu time alone inside
os.path.walk!  I used win32 function to rewrite walk, and speed up my
program to as fast as NT Find.  See code below:

import sys, os, string, time, re

from win32api import FindFiles
FILE_ATTRIBUTE_DIRECTORY = 0x00000010
DIR_EXCLUDES = ('.', '..')
def win_walk(top, func, arg):
    """Directory tree walk with callback function.

    win_walk(top, func, arg) calls func(arg, d, f_objs, dirs) for each
directory
    d in the tree rooted at top (including top itself); f_objs is a
tuple of file
    attributes of all the files and subdirs in directory d. subdirs are
further
    walking subdirectories.
    """
    try:
        """
        find all files under directory: top
        return variable is a tuple contains file attributes list
        that item 0 is File Attributes, item 8 is name. (see win32api
doc)
        """
        f_objs = FindFiles(top+'/*')
        # sort out subdirs
        subdirs = []
        for f_obj in f_objs:
            if f_obj[0] & FILE_ATTRIBUTE_DIRECTORY and \
                f_obj[8] not in DIR_EXCLUDES:
                subdirs.append(f_obj[8])
    except os.error:
        return
    # call callback function
    func(arg, top, f_objs, subdirs)
    # do walking
    for dir in subdirs:
        name = top+'/'+dir
        win_walk(name, func, arg)

# remember current time
CUR_TIME = time.time()
# get time module compatible time from PyTime object
def wpy2time(pytime):
    f_time = int(pytime) # file last write time
    #fix win32 PyTime bug
    return f_time - time.altzone

# find all debug directories that are older than one week
debug_m = re.compile('abc.+\.debug', re.I).match
HOUR_24 = 24*3600
WEEK_1 = HOUR_24*7
def win_act(verbose, top, f_objs, dirs):
    if verbose > 1: print "checking directory:", top
    for f_obj in f_objs:
        dir = f_obj[8] # file name
        if dir not in dirs: # directory only
            continue
        if dir[-1] in ('g','G') and debug_m(dir):
            dirs.remove(dir) #stop looking into this
            f_time = wpy2time(f_obj[3]) # file last write time
            f_age = CUR_TIME - f_time
            if (f_age > WEEK_1): # file is older the 24 hours
                path = top+'/'+dir
                print 'delete directory:', path

from os import listdir
from os.path import isdir, walk, getmtime

def act(verbose, top, names):
    if verbose > 1: print "checking directory:", top
    dirs = names[:]
    for dir in dirs:
        path = top+'/'+dir
        if not isdir(path): # directory only
            continue
        if dir[-1] in ('g','G') and debug_m(dir):
            names.remove(dir) #stop looking into this
            f_time = getmtime(path) # file last write time
            f_age = CUR_TIME - f_time
            if (f_age > WEEK_1): # file is older than one week
                print 'delete directory:', path

verbose = 0
top = "d:/projects"

tt = time.time()
os.path.walk(top, act, verbose)
print 'walk time spent:', time.time() - tt

tt = time.time()
win_walk(top, win_act, verbose)
print 'win_walk time spent:', time.time() - tt

# The End *****************************************


P.S.

If we change os.listdir to return a list of useString kind of object
that can do, isdir, getmtime, we can replace os.path.walk and take NT
advantage to speed up whole process


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list