Almost Done: Need some Help in Generating FEATURE VECTORS

Fri Mar 5 22:11:39 EST 2004

Hi Josiah and Others,
Thanks a Ton.
I could figure out my work with your help.
However, I am stuck up with a little thing now:

Dictionary: I want to associate each word in the
dictionary with an index. Ex:
right now I have:

viagra
play

etc
I want to associate each word in the dictionary with  
an index like this:

1 viagra
2 play

I tried with enumerate did not work.

Problem 2: I want to create feature vectors of the
type

[1 1:<value> 10:<value> ...18:<value>]

I am able to compute the right value. I have to
associate this value with the index in the dictionary.

I want some help regarding framing this feature vector
specifically adding '[' ,']' and inserting a value 1
or 0 which should be from the user input. And a index:
value pair.

The loop that checks the word with the dictionary  is
here:

A body of the program is here:

import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt

#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
     dct[words[i]] = i

#make vector:
vector = {}

fp=open(sys.argv[1], 'r')

msg=email.message_from_file(fp)

msg=msg.get_payload()

#a = float(len(fp))

#a = float(len(words_in_body))

#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')

#words_in_body = fixed_body.split()

msg = fixed_body.split()

a = float(len(msg))
print a

for i in msg:
     if i in dct:
         try:
             vector[i] += 1
         except:
             vector[i] = 1

for i in vector:
    vector[i] /= a
    print i, vector[i]

--- Josiah Carlson <jcarlson at nospam.uci.edu> wrote:
> #First, normalize the line breaks:
> email_source = email_source.replace('\r\n',
> '\n').replace('\r', '\n')
> 
> #toss the headers:
> pos = email_source.find('\n\n')
> if pos != -1:
>      email_body = email_source[pos:]
> else:
>      email_body = email_source
> 
> #clean out html:
> (use the method given
> http://flangy.com/dev/python/striphtml.html )
> 
> #get rid of anything that isn't a letter, and make
> it all lowercase:
> lower = ''.join(map(chr, range(97, 123)))
> fixed_body = email_body.translate(65*' '+lower+6*'
> '+lower+133*' ')
> 
> words_in_body = fixed_body.split()
> 
> #load up external dictionary:
> words = open('dictionary', 'r').read().split()
> dct = {}
> for i in xrange(len(words)):
>      dct[words[i]] = i
> 
> #make vector:
> vector = {}
> a = float(len(words_in_body))
> for i in words_in_body:
>      if i in dct:
>          try:
>              vector[i] += 1
>          except:
>              vector[i] = 1
> 
> for i in vector:
>      vector[i] /= a
> 
> 
> 
> I know the above doesn't fit with what you have, but
> you should be able 
> to adapt it.
> 
>   - Josiah
> -- 
> http://mail.python.org/mailman/listinfo/python-list

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com