Almost Done: Need some Help in Generating FEATURE VECTORS
Josiah Carlson
jcarlson at nospam.uci.edu
Fri Mar 5 18:22:56 EST 2004
#First, normalize the line breaks:
email_source = email_source.replace('\r\n', '\n').replace('\r', '\n')
#toss the headers:
pos = email_source.find('\n\n')
if pos != -1:
email_body = email_source[pos:]
else:
email_body = email_source
#clean out html:
(use the method given http://flangy.com/dev/python/striphtml.html )
#get rid of anything that isn't a letter, and make it all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = email_body.translate(65*' '+lower+6*' '+lower+133*' ')
words_in_body = fixed_body.split()
#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
dct[words[i]] = i
#make vector:
vector = {}
a = float(len(words_in_body))
for i in words_in_body:
if i in dct:
try:
vector[i] += 1
except:
vector[i] = 1
for i in vector:
vector[i] /= a
I know the above doesn't fit with what you have, but you should be able
to adapt it.
- Josiah
More information about the Python-list
mailing list