[Tutor] about a program

Tue Apr 15 10:43:01 2003

--0-549502866-1050417430=:67447
Content-Type: multipart/alternative; boundary="0-1818477357-1050417430=:67447"

--0-1818477357-1050417430=:67447
Content-Type: text/plain; charset=us-ascii

hi evryone,

I am trying to implemnent a program that searches keywords stored in sveral documents,by first indexing and computing weghting for each document by calculating frequency of each keyword(Term Frequency(TF))and Inverse document 

frequency(IDF). IDF = 

log( Number of elements in the collection / frequency of each element)

weighting = IDF * TF.

I have already setup the indexing by using fileinut() function which indexes the word and the files it occurs like this:

55 comments [('File-03.txt', 4)]

56 speech [('tmp.txt', 5)]

57 frequencies [('tmp.txt', 5)]

58 new [('tmp.txt', 5)]

59 acknowledgments [('File-03.txt', 3)]

Can any one give me any idea how I can incoorporate the weighting ? because

this is the design that I have chosen and want make it work

I really gt stuck.......

any suggestions would be appreciated... I have also attached the code

thanks in advance

---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
--0-1818477357-1050417430=:67447
Content-Type: text/html; charset=us-ascii

<DIV>
<DIV><FONT size=2>
<P>hi evryone,</P>
<P>I am trying to implemnent a program that searches keywords stored in sveral documents,by first indexing and computing weghting for each document by calculating frequency of each keyword(Term Frequency(TF))and Inverse document </P>
<P>frequency(IDF). IDF = </P>
<P>log( Number of elements in the collection / frequency of each element)</P>
<P>weighting = IDF * TF.</P>
<P>I have already setup the indexing by using fileinut() function which indexes the word and the files it occurs like this:</P>
<P><EM>55 comments [('File-03.txt', 4)]</EM></P>
<P><EM>56 speech [('tmp.txt', 5)]</EM></P>
<P><EM>57 frequencies [('tmp.txt', 5)]</EM></P>
<P><EM>58 new [('tmp.txt', 5)]</EM></P>
<P><EM>59 acknowledgments [('File-03.txt', 3)]</EM></P>
<P>Can any one give me any idea how I can incoorporate the weighting ? because</P>
<P>this is the design that I have chosen and want make it work</P>
<P>I really gt stuck.......</P>
<P>any suggestions would be appreciated... I have also attached the code</P>
<P>thanks in advance</P></FONT></DIV></DIV><p><br><hr size=1>Do you Yahoo!?<br>
<a href="http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com">The New Yahoo! Search</a> - Faster. Easier. Bingo.
--0-1818477357-1050417430=:67447--
--0-549502866-1050417430=:67447
Content-Type: text/plain; name="genIndex.py"
Content-Description: genIndex.py
Content-Disposition: inline; filename="genIndex.py"

import glob, getopt
import fileinput,re,shelve,linecache,sys
#from TextSplitter import TextSplitter

#aword = re.compile(r'\b[\w-]+\b')
aword =re.compile (r'<[^<>]*>|\b[\w-]+\b') #using xml as well.----\b[\w-]+\b
index={}

# Generate an index in file indexFileName

def genIndex(indexFileName, extension):
   """ this function takes a file without the extension and
       returns the the tokens in the file indexed. it also
       returns the word, the file found and the line where
       the word was found """

   stop_list = ['from','to','that','by','with','on','the',
                'on','a','and','these','of','or','for','can',
                'it','is','this','in','an','you', 'your',
                'yours','our','his','will','some','are','et',
                'we','most','be','those','there','such','other',
                'such','like']

   fname='*.'+extension

   #print for testing purposes   
   print "----------------------------------------------------"
   print fname  #print for testing
   for line in fileinput.input(glob.glob(fname)):
      #get the filename and location where the word was found(lineNumber)
      location = fileinput.filename(), fileinput.filelineno()
      #find all words that are relevante
      for word in aword.findall(line.lower()):
         if word[0] != '<':

         #append the words found in dictionary with file name and location
            index.setdefault(word,[]).append(location)

   # open a shelve file and store the result of indexing         
   shelf = shelve.open(indexFileName,'n')
   count_words = 0
   for word in index:
      #eliminate all stoplist words
      if word not in stop_list:

         shelf[word] = index[word]
         count_words += 1 # used for computing term frequency

         #print for testing purposes
         print count_words ,word , "\t" , shelf[word]
   print "total words", count_words
   print "---------------------------------------------"
   shelf.close()

#--------------------------------------------------------------
if __name__ == '__main__':
    import sys
#    print "hello"
    for arg in sys.argv[1:]:
        genIndex(arg, 'txt')

--0-549502866-1050417430=:67447--