Most efficient method to search text?

Robin Siebler robin.siebler at corp.palm.com
Tue Oct 15 20:35:29 EDT 2002


I wrote a script to search a slew of files for certain words and
names.  However, I am sure that there has to be a faster/better way to
do it.  Here is what I am doing:

1.  Load words to exclude into a list.
2.  Load names to exclude for into a list.
3.  Load words to include into a list.
4.  Remove any duplicates from the name list.
5.  Generate a list of files to search.
6.  Open the 1st file.
7.  Search each line:
    a.  For a word (line.find(word)).  If I get a hit, I then use a RE
        to perform a more exact search (the pattern I am using is
        '\w+word|word\w+|word').
        i.  Compare any matches against the include/exclude list.  If 
            it is a match, keep searching.  Otherwise, log the line.
    b.  For a name (line.find(name)).  If I get a hit, I then use a RE
        to perform a more exact search (the pattern that I am using is
        '\bname\b'.  If I get a hit, log the line.

The reason that I first search using line.find() is that in the past I
have done some searches for simple strings and found that line.find() 
was much faster than an RE, so I am only using an RE when I need to.  
I am including my code below (unforunately, Google screws the
formating up).  Any suggestions to improve it would be appreciated.

LinesFound = []; LinesFound = FunkyList(LinesFound); msg = ""
LineNum = 0; Header = 0
LogFile = var['LogFile']
print '\nGenerating file list...'   #Let user see that script is
running
FilesToSearch = listFiles(var['SearchPath'], var['SearchExt'],
var['Recurse'])
  if len(FilesToSearch) == 0:
      print 'No Files Found!'
      clean_up(var)
  else:
      print 'Number of files to search: ' + str(len(FilesToSearch))
  print 'Processing files...',  #Let user see that script is running
  while FilesToSearch:
      FileBeingSearched = FilesToSearch.pop() #Get/remove last file
name
      open_file = open(FileBeingSearched)
      print "\nProcessing " + FileBeingSearched,
      for line in open_file.xreadlines():
          LineNum += 1
        #Let user see that script is running
          if LineNum >24 and LineNum % 100==0: print ".", 
          for word in var['ExcludeWords']:  #Search line for
proscribed words
#Perform a case insensitive search for word *anywhere* in the line
              if line.lower().find(word.lower()) != -1:           
                  pattern = '\w+word|word\w+|word'
                  pattern = pattern.replace('word', word.lower())
                  s_word = re.compile(pattern, re.IGNORECASE)
#If the phrase was found, get a list containing the matches
                  match_found = unique(s_word.findall(line))      
                  for match in match_found:
                      #If the word contains an underscore
                      if match.find('_') != -1:                   
                          words = '\w+'
                          w_find = re.compile(words, re.IGNORECASE)
                          words = ''
                          for item in w_find.findall(line):
                              if item.find('_') != -1:
                                  words = words + ' ' +
str(item.split('_'))
                              else:
                                  words = words + ' ' + item
                          m_found = unique(s_word.findall(words))
                          for item in m_found:
                              if item in var['ExcludeWords'] and item
not in var['IncludeWords']:
                                  msg = '\tLine ' + str(LineNum) + ':
The word "' + \
                                          word + '" was found in: "' +
line.strip() + '"'
                                    LinesFound.append(msg)
                                    break;
                        elif match not in var['IncludeWords']:     
#Is the word in IncludeWords?
                            msg = '\tLine ' + str(LineNum) + ': The
word "' + \
                                  word + '" was found in: "' +
line.strip() + '"'
                            LinesFound.append(msg)
                            break;
            for name in var['Names']:                          
#Search line for names
                    if line.lower().find(name.lower()) != -1:  
#Perform a case insensitive search
                        pattern = '\bname\b'
                        pattern = pattern.replace('name', name)
                        s_word = re.compile(pattern, re.IGNORECASE)
                        match_found = unique(s_word.findall(line))    
#If the phrase was found, get a list containing the matches
                        for match in match_found:
                            if match in var['Names']:
                                msg = '\tLine ' + str(LineNum) + ':
The name "' + name + \
                                '" was found in: "' + line.strip()
+'"'
                                LinesFound.append(msg)
                                break;
            if len(LinesFound) > 0:
                if  not Header:        
                    LogFile.write('Proscribed words were found in ' +
FileBeingSearched + '\n')
                    LogFile.write('\n')
                    Header = 1
                for line in LinesFound:
                    LogFile.write(line + '\n')
                LogFile.write('\n')
                LinesFound = []
        open_file.close()
        LineNum = 0; Header = 0; hit = 0
    print '\nProcessing Complete.'



More information about the Python-list mailing list