supper fast walk
gangli at msn.com
gangli at msn.com
Tue Sep 12 15:12:45 EDT 2000
When I am asked to develop a program to clean up hard disk with huge
collection of directories and files (100K), I convinced my boss to let
me use Python to do it. I promised I would deliver the program 10
times faster than anybody doing it in C++ or 5 times in Java. I did
deliver it on time, but the program runs very slow. It was 10 times
slower than using NT GUI Find in simple case. I used profile to look
into the problem. The os.path.isdir took 46% cpu time alone inside
os.path.walk! I used win32 function to rewrite walk, and speed up my
program to as fast as NT Find. See code below:
import sys, os, string, time, re
from win32api import FindFiles
FILE_ATTRIBUTE_DIRECTORY = 0x00000010
DIR_EXCLUDES = ('.', '..')
def win_walk(top, func, arg):
"""Directory tree walk with callback function.
win_walk(top, func, arg) calls func(arg, d, f_objs, dirs) for each
directory
d in the tree rooted at top (including top itself); f_objs is a
tuple of file
attributes of all the files and subdirs in directory d. subdirs are
further
walking subdirectories.
"""
try:
"""
find all files under directory: top
return variable is a tuple contains file attributes list
that item 0 is File Attributes, item 8 is name. (see win32api
doc)
"""
f_objs = FindFiles(top+'/*')
# sort out subdirs
subdirs = []
for f_obj in f_objs:
if f_obj[0] & FILE_ATTRIBUTE_DIRECTORY and \
f_obj[8] not in DIR_EXCLUDES:
subdirs.append(f_obj[8])
except os.error:
return
# call callback function
func(arg, top, f_objs, subdirs)
# do walking
for dir in subdirs:
name = top+'/'+dir
win_walk(name, func, arg)
# remember current time
CUR_TIME = time.time()
# get time module compatible time from PyTime object
def wpy2time(pytime):
f_time = int(pytime) # file last write time
#fix win32 PyTime bug
return f_time - time.altzone
# find all debug directories that are older than one week
debug_m = re.compile('abc.+\.debug', re.I).match
HOUR_24 = 24*3600
WEEK_1 = HOUR_24*7
def win_act(verbose, top, f_objs, dirs):
if verbose > 1: print "checking directory:", top
for f_obj in f_objs:
dir = f_obj[8] # file name
if dir not in dirs: # directory only
continue
if dir[-1] in ('g','G') and debug_m(dir):
dirs.remove(dir) #stop looking into this
f_time = wpy2time(f_obj[3]) # file last write time
f_age = CUR_TIME - f_time
if (f_age > WEEK_1): # file is older the 24 hours
path = top+'/'+dir
print 'delete directory:', path
from os import listdir
from os.path import isdir, walk, getmtime
def act(verbose, top, names):
if verbose > 1: print "checking directory:", top
dirs = names[:]
for dir in dirs:
path = top+'/'+dir
if not isdir(path): # directory only
continue
if dir[-1] in ('g','G') and debug_m(dir):
names.remove(dir) #stop looking into this
f_time = getmtime(path) # file last write time
f_age = CUR_TIME - f_time
if (f_age > WEEK_1): # file is older than one week
print 'delete directory:', path
verbose = 0
top = "d:/projects"
tt = time.time()
os.path.walk(top, act, verbose)
print 'walk time spent:', time.time() - tt
tt = time.time()
win_walk(top, win_act, verbose)
print 'win_walk time spent:', time.time() - tt
# The End *****************************************
P.S.
If we change os.listdir to return a list of useString kind of object
that can do, isdir, getmtime, we can replace os.path.walk and take NT
advantage to speed up whole process
Sent via Deja.com http://www.deja.com/
Before you buy.
More information about the Python-list
mailing list