Okay Heres the problem: in Index

dont bother dontbotherworld at yahoo.com
Tue Mar 9 20:11:57 EST 2004


Hi Buddies,

Okay heres my post again and I have mentioned it very
clearly:

I have these two pieces of code. dictionary.py and
vector.py

1. dictionary.py: I take a spam message, strip off
html and split words. The words are then stored in a
file, "dictionary_index".

The words stored in the dictionary have the format
"14 : sexy"

that is

number : word

2. I have this another code "vector.py". When a new
email message arrives, I strip off html again and
compare the words in the payload with the words in the
dictionary (from the file "dictionary_index")
When I run vector.py against a new message,

$python vector.py email_message

I have an output like this:

0 code 0.00224466891134
1 help 0.00224466891134
2 mby 0.00224466891134
3 zzzz 0.00448933782267
4 both 0.00224466891134
5 mbqpw 0.00112233445567
6 syntax 0.00224466891134
7 ratree 0.00224466891134
8 edt 0.003367003367

The numerical values are the number of occurances of
the word divided by the total number of the words in
the email message.

Now, my problems are:

1. The numbers 0, 1, 2, 3, that are in the output of
the vector.py are the corresponding position of the
words in the email_message, that I ran against the
dictionary_index. I want the index of the
corresponding word in the dictionary. For example: If
"syntax" was occuring in the dictionary at 500. I want
the 500 syntax <value> instead of 6 syntax <value>
that I am getting right now.

2. I want to write this output to the file in the
format:

1 index value index value index value index value
index value

Note that:
a)I am writing 1 myself to indicate its a spam in the
feature vector.
I may have to write 0 also at some places.
b)I dont write the word in this vector. Only <index
value>, where index is the corresponding position of
the word in the dictionary which occurs in the
message.

Unfortunately, I dont know how to fix my problems 1
and 2 and would really appreciate if some one can hint
me at that.
I am attaching the pieces of code here:

Thanks
Dont





----------------------------------------------------
# python code for creating dictionary of words from a
message : dictionary.py

input file
import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt


fp=open(sys.argv[1], 'r')

msg=email.message_from_file(fp)

msg=msg.get_payload()

dictpos={}
wordcount={}
#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')

#words_in_body = fixed_body.split()

msg = fixed_body.split()


for i, w in enumerate(file('dictionary_index')):
	dictpos[w.strip()]=i
	#print i
	#print w

for w in msg:
	try:
		wordcount[w]+=1
		#print wordcount
	except KeyError:
		wordcount[w]=1
		#print wordcount

for w, c in wordcount.iteritems():
	try:
		print dictpos[w],':',c
	except KeyError:
		pass



#print wordcount
#print dictpos
#print '\n'


#-------------------------------------------------
#vector.py



import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt



#load up external dictionary:
words = open('dictionary_index', 'r').read().split()
dct = {}
for i in xrange(len(words)):
     dct[words[i]] = i

print dct.values()

#make vector:
vector = {}

fp=open(sys.argv[1], 'r')

msg=email.message_from_file(fp)

msg=msg.get_payload()

#a = float(len(fp))

#a = float(len(words_in_body))


#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')

#words_in_body = fixed_body.split()

msg = fixed_body.split()

a = float(len(msg))
print a

for i in msg:
     if i in dct:
         try:
             vector[i] += 1

         except:
             vector[i] = 1

for v,i in enumerate(vector):
    vector[i] /= a
    print v,i, vector[i]
    #; if u want to see the word too that was commmon
    #print v, ":",vector[i]


    #rint "\n"

#1.write(s)
#1.close()





__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com




More information about the Python-list mailing list