Almost Done: Need some Help in Generating FEATURE VECTORS
dont bother
dontbotherworld at yahoo.com
Fri Mar 5 22:11:39 EST 2004
Hi Josiah and Others,
Thanks a Ton.
I could figure out my work with your help.
However, I am stuck up with a little thing now:
Dictionary: I want to associate each word in the
dictionary with an index. Ex:
right now I have:
viagra
play
etc
I want to associate each word in the dictionary with
an index like this:
1 viagra
2 play
I tried with enumerate did not work.
Problem 2: I want to create feature vectors of the
type
[1 1:<value> 10:<value> ...18:<value>]
I am able to compute the right value. I have to
associate this value with the index in the dictionary.
I want some help regarding framing this feature vector
specifically adding '[' ,']' and inserting a value 1
or 0 which should be from the user input. And a index:
value pair.
The loop that checks the word with the dictionary is
here:
A body of the program is here:
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt
#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
dct[words[i]] = i
#make vector:
vector = {}
fp=open(sys.argv[1], 'r')
msg=email.message_from_file(fp)
msg=msg.get_payload()
#a = float(len(fp))
#a = float(len(words_in_body))
#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')
#words_in_body = fixed_body.split()
msg = fixed_body.split()
a = float(len(msg))
print a
for i in msg:
if i in dct:
try:
vector[i] += 1
except:
vector[i] = 1
for i in vector:
vector[i] /= a
print i, vector[i]
--- Josiah Carlson <jcarlson at nospam.uci.edu> wrote:
> #First, normalize the line breaks:
> email_source = email_source.replace('\r\n',
> '\n').replace('\r', '\n')
>
> #toss the headers:
> pos = email_source.find('\n\n')
> if pos != -1:
> email_body = email_source[pos:]
> else:
> email_body = email_source
>
> #clean out html:
> (use the method given
> http://flangy.com/dev/python/striphtml.html )
>
> #get rid of anything that isn't a letter, and make
> it all lowercase:
> lower = ''.join(map(chr, range(97, 123)))
> fixed_body = email_body.translate(65*' '+lower+6*'
> '+lower+133*' ')
>
> words_in_body = fixed_body.split()
>
> #load up external dictionary:
> words = open('dictionary', 'r').read().split()
> dct = {}
> for i in xrange(len(words)):
> dct[words[i]] = i
>
> #make vector:
> vector = {}
> a = float(len(words_in_body))
> for i in words_in_body:
> if i in dct:
> try:
> vector[i] += 1
> except:
> vector[i] = 1
>
> for i in vector:
> vector[i] /= a
>
>
>
> I know the above doesn't fit with what you have, but
> you should be able
> to adapt it.
>
> - Josiah
> --
> http://mail.python.org/mailman/listinfo/python-list
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what youre looking for faster
http://search.yahoo.com
More information about the Python-list
mailing list