Threads not Improving Performance in Program

Thu Mar 19 12:59:03 EDT 2009

On Mar 19, 9:50 am, Ryan Rosario <uclamath... at gmail.com> wrote:
> I have a parser that needs to process 7 million files. After running
> for 2 days, it had only processed 1.5 million. I want this script to
> parse several files at once by using multiple threads: one for each
> file currently being analyzed.
>
> My code iterates through all of the directories within a directory,
> and at each directory, iterates through each file in that directory. I
> structured my code something like this. I think I might be
> misunderstanding how to use threads:
>
> mythreads = []
> for directory in dirList:
>  #some processing...
>  for file in fileList:
>     p = Process(currDir,directory,file)    #class that extends thread.Threading
>     mythreads.append(p)
>     p.start()
>
> for thread in mythreads:
>  thread.join()
>  del thread
>
> The actual class that extends threading.thread is below:
>
> class Process(threading.Thread):
>         vlock = threading.Lock()
>         def __init__(self,currDir,directory,file):      #thread constructor
>                 threading.Thread.__init__(self)
>                 self.currDir = currDir
>                 self.directory = directory
>                 self.file = file
>         def run(self):
>                 redirect = re.compile(r'#REDIRECT',re.I)
>                 xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
>                 try:
>                         markup =
> xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
>                 except:
>                         #An error occurred
>                         Process.vlock.acquire()
>                         BAD = open("bad.log","a")
>                         BAD.writelines(self.file + "\n")
>                         BAD.close()
>                         Process.vlock.release()
>                         print "Error."
>                         return
>                 #if successful, do more processing...
>
> I did an experiment with a variety of numbers of threads and there is
> no performance gain. The code is taking the same amount of time to
> process 1000 files as it would if the code did not use threads. Any
> ideas on what I am doing wrong?

Perhabs the bottleneck is the IO. How big are the files you are trying
to parse? Another possible bottleneck is that threads share memory.
Thread construction is also expensive in python, try looking up a
threadPool class. (ThreadPools are collections of threads that do
work, when they finish they go to an idle state untill you give them
more work, that way you aren't constantly creating new threads which
is expensive)