Okay Heres the problem: in Index
dont bother
dontbotherworld at yahoo.com
Tue Mar 9 20:11:57 EST 2004
Hi Buddies,
Okay heres my post again and I have mentioned it very
clearly:
I have these two pieces of code. dictionary.py and
vector.py
1. dictionary.py: I take a spam message, strip off
html and split words. The words are then stored in a
file, "dictionary_index".
The words stored in the dictionary have the format
"14 : sexy"
that is
number : word
2. I have this another code "vector.py". When a new
email message arrives, I strip off html again and
compare the words in the payload with the words in the
dictionary (from the file "dictionary_index")
When I run vector.py against a new message,
$python vector.py email_message
I have an output like this:
0 code 0.00224466891134
1 help 0.00224466891134
2 mby 0.00224466891134
3 zzzz 0.00448933782267
4 both 0.00224466891134
5 mbqpw 0.00112233445567
6 syntax 0.00224466891134
7 ratree 0.00224466891134
8 edt 0.003367003367
The numerical values are the number of occurances of
the word divided by the total number of the words in
the email message.
Now, my problems are:
1. The numbers 0, 1, 2, 3, that are in the output of
the vector.py are the corresponding position of the
words in the email_message, that I ran against the
dictionary_index. I want the index of the
corresponding word in the dictionary. For example: If
"syntax" was occuring in the dictionary at 500. I want
the 500 syntax <value> instead of 6 syntax <value>
that I am getting right now.
2. I want to write this output to the file in the
format:
1 index value index value index value index value
index value
Note that:
a)I am writing 1 myself to indicate its a spam in the
feature vector.
I may have to write 0 also at some places.
b)I dont write the word in this vector. Only <index
value>, where index is the corresponding position of
the word in the dictionary which occurs in the
message.
Unfortunately, I dont know how to fix my problems 1
and 2 and would really appreciate if some one can hint
me at that.
I am attaching the pieces of code here:
Thanks
Dont
----------------------------------------------------
# python code for creating dictionary of words from a
message : dictionary.py
input file
import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt
fp=open(sys.argv[1], 'r')
msg=email.message_from_file(fp)
msg=msg.get_payload()
dictpos={}
wordcount={}
#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')
#words_in_body = fixed_body.split()
msg = fixed_body.split()
for i, w in enumerate(file('dictionary_index')):
dictpos[w.strip()]=i
#print i
#print w
for w in msg:
try:
wordcount[w]+=1
#print wordcount
except KeyError:
wordcount[w]=1
#print wordcount
for w, c in wordcount.iteritems():
try:
print dictpos[w],':',c
except KeyError:
pass
#print wordcount
#print dictpos
#print '\n'
#-------------------------------------------------
#vector.py
import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt
#load up external dictionary:
words = open('dictionary_index', 'r').read().split()
dct = {}
for i in xrange(len(words)):
dct[words[i]] = i
print dct.values()
#make vector:
vector = {}
fp=open(sys.argv[1], 'r')
msg=email.message_from_file(fp)
msg=msg.get_payload()
#a = float(len(fp))
#a = float(len(words_in_body))
#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')
#words_in_body = fixed_body.split()
msg = fixed_body.split()
a = float(len(msg))
print a
for i in msg:
if i in dct:
try:
vector[i] += 1
except:
vector[i] = 1
for v,i in enumerate(vector):
vector[i] /= a
print v,i, vector[i]
#; if u want to see the word too that was commmon
#print v, ":",vector[i]
#rint "\n"
#1.write(s)
#1.close()
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what youre looking for faster
http://search.yahoo.com
More information about the Python-list
mailing list