[Tutor] about a program
Abdirizak abdi
a_abdi406@yahoo.com
Tue Apr 15 10:43:01 2003
--0-549502866-1050417430=:67447
Content-Type: multipart/alternative; boundary="0-1818477357-1050417430=:67447"
--0-1818477357-1050417430=:67447
Content-Type: text/plain; charset=us-ascii
hi evryone,
I am trying to implemnent a program that searches keywords stored in sveral documents,by first indexing and computing weghting for each document by calculating frequency of each keyword(Term Frequency(TF))and Inverse document
frequency(IDF). IDF =
log( Number of elements in the collection / frequency of each element)
weighting = IDF * TF.
I have already setup the indexing by using fileinut() function which indexes the word and the files it occurs like this:
55 comments [('File-03.txt', 4)]
56 speech [('tmp.txt', 5)]
57 frequencies [('tmp.txt', 5)]
58 new [('tmp.txt', 5)]
59 acknowledgments [('File-03.txt', 3)]
Can any one give me any idea how I can incoorporate the weighting ? because
this is the design that I have chosen and want make it work
I really gt stuck.......
any suggestions would be appreciated... I have also attached the code
thanks in advance
---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
--0-1818477357-1050417430=:67447
Content-Type: text/html; charset=us-ascii
<DIV>
<DIV><FONT size=2>
<P>hi evryone,</P>
<P>I am trying to implemnent a program that searches keywords stored in sveral documents,by first indexing and computing weghting for each document by calculating frequency of each keyword(Term Frequency(TF))and Inverse document </P>
<P>frequency(IDF). IDF = </P>
<P>log( Number of elements in the collection / frequency of each element)</P>
<P>weighting = IDF * TF.</P>
<P>I have already setup the indexing by using fileinut() function which indexes the word and the files it occurs like this:</P>
<P><EM>55 comments [('File-03.txt', 4)]</EM></P>
<P><EM>56 speech [('tmp.txt', 5)]</EM></P>
<P><EM>57 frequencies [('tmp.txt', 5)]</EM></P>
<P><EM>58 new [('tmp.txt', 5)]</EM></P>
<P><EM>59 acknowledgments [('File-03.txt', 3)]</EM></P>
<P>Can any one give me any idea how I can incoorporate the weighting ? because</P>
<P>this is the design that I have chosen and want make it work</P>
<P>I really gt stuck.......</P>
<P>any suggestions would be appreciated... I have also attached the code</P>
<P>thanks in advance</P></FONT></DIV></DIV><p><br><hr size=1>Do you Yahoo!?<br>
<a href="http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com">The New Yahoo! Search</a> - Faster. Easier. Bingo.
--0-1818477357-1050417430=:67447--
--0-549502866-1050417430=:67447
Content-Type: text/plain; name="genIndex.py"
Content-Description: genIndex.py
Content-Disposition: inline; filename="genIndex.py"
import glob, getopt
import fileinput,re,shelve,linecache,sys
#from TextSplitter import TextSplitter
#aword = re.compile(r'\b[\w-]+\b')
aword =re.compile (r'<[^<>]*>|\b[\w-]+\b') #using xml as well.----\b[\w-]+\b
index={}
# Generate an index in file indexFileName
def genIndex(indexFileName, extension):
""" this function takes a file without the extension and
returns the the tokens in the file indexed. it also
returns the word, the file found and the line where
the word was found """
stop_list = ['from','to','that','by','with','on','the',
'on','a','and','these','of','or','for','can',
'it','is','this','in','an','you', 'your',
'yours','our','his','will','some','are','et',
'we','most','be','those','there','such','other',
'such','like']
fname='*.'+extension
#print for testing purposes
print "----------------------------------------------------"
print fname #print for testing
for line in fileinput.input(glob.glob(fname)):
#get the filename and location where the word was found(lineNumber)
location = fileinput.filename(), fileinput.filelineno()
#find all words that are relevante
for word in aword.findall(line.lower()):
if word[0] != '<':
#append the words found in dictionary with file name and location
index.setdefault(word,[]).append(location)
# open a shelve file and store the result of indexing
shelf = shelve.open(indexFileName,'n')
count_words = 0
for word in index:
#eliminate all stoplist words
if word not in stop_list:
shelf[word] = index[word]
count_words += 1 # used for computing term frequency
#print for testing purposes
print count_words ,word , "\t" , shelf[word]
print "total words", count_words
print "---------------------------------------------"
shelf.close()
#--------------------------------------------------------------
if __name__ == '__main__':
import sys
# print "hello"
for arg in sys.argv[1:]:
genIndex(arg, 'txt')
--0-549502866-1050417430=:67447--