Threads not Improving Performance in Program

Ryan Rosario uclamathguy at gmail.com
Thu Mar 19 12:50:51 EDT 2009


I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.

My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:

mythreads = []
for directory in dirList:
 #some processing...
 for file in fileList:
    p = Process(currDir,directory,file)    #class that extends thread.Threading
    mythreads.append(p)
    p.start()

for thread in mythreads:
 thread.join()
 del thread

The actual class that extends threading.thread is below:

class Process(threading.Thread):
        vlock = threading.Lock()
        def __init__(self,currDir,directory,file):      #thread constructor
                threading.Thread.__init__(self)
                self.currDir = currDir
                self.directory = directory
                self.file = file
        def run(self):
                redirect = re.compile(r'#REDIRECT',re.I)
                xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
                try:
                        markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
                except:
                        #An error occurred
                        Process.vlock.acquire()
                        BAD = open("bad.log","a")
                        BAD.writelines(self.file + "\n")
                        BAD.close()
                        Process.vlock.release()
                        print "Error."
                        return
                #if successful, do more processing...


I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?



More information about the Python-list mailing list