Help With EOF character and regular expression matching: URGENT

dont bother dontbotherworld at yahoo.com
Sun Feb 22 19:54:12 EST 2004


Hi Eric,
Thanks for your help.
But I had fixed the problem myself by using a
combination of fseek and ftell. I am attaching the
code. I am a new bie so figuring out things with
python.
Currently I have one problem and I dont know if there
are any good ways to solve it in python:

I want to create a dictionary of words out of the spam
datasets and legitimate email datasets. While I can
extract each and every word from the spam and
legitimate emails it is not advisable to do so. I want
to strip off the headers,
like:
To
From
Returned Path
etc...
and also the characters that are not ASCII and also
the characters that are between <> so as to avoid HTML
Tags.
I have zero experience with regular expressions
but if you or some one can give me an idea/snippet I
think I can make it work.
Also while I can write the words extracted to a file
what are the advisable ways to associate them with the
index? Also I want to avoid writing in the dictionary
the same 2 words with different indexes?
Any help is highly appreciated...
Thanks,
Sincerely
Dont


--- "Eric @ Zomething" <eric at zomething.com> wrote:
> dont bother wrote:
> 
> > 
> > Hi Buddies,
> > I am facing this problem and I dont know what to
> use
> > as EOF in python:
> > I want to read a file, and put all the individual
> > words in a dictionary with their index:
> > For example if the file is:
> > 
> > Hello there I am doing fine
> > How are you?
> > 
> > So I want to make an index like this:
> > 
> > 1 Hello
> > 2 there
> > 3 I
> > 4 am
> > 5 doing
> > 6 fine
> > 7 How
> > 8 are
> > 9 you
> > 10 ?
> > 
> > In order to do this: I have written a small code
> which
> > is here:
> >
>
-------------------------------------------------------
> > # python code for creating dictionary of words
> from an
> > #input file
> >
>
------------------------------------------------------
> > 
> > import os
> > import sys
> > try:
> >         fread = open('training_data', 'r')
> > except IOError:
> >         print 'Cant open file for reading'
> >         sys.exit(0)
> > print 'Okay reading the file'
> > s=""
> > a=fread.read(1)
> > while (a!="\003"):
> > #while 1:
> >                 s=s+a
> > 
> 
> <snip>
> 
> Dont, I think you want to iterate over your file,
> rather than look for an EOF marker, unless you have
> a technical reason not to.
> 
> I'll leave it to someone else to provide a better
> solution, but here is a Q&D newbie approach to a
> text file break-out of words:
> 
> >>> fileOne=open('C:\\testfile.txt')
> >>> fileString=fileOne.read()
> >>> print fileString
> Hello there I am doing fine
> How are you?
> >>> wordString=fileString.replace('\n',' ')
> >>> print wordString
> Hello there I am doing fine How are you? 
> >>> wordList=wordString.split(' ')
> >>> print wordList
> ['Hello', 'there', 'I', 'am', 'doing', 'fine',
> 'How', 'are', 'you?', '']
> 
> HTH,
> 
> Eric
> 
> --
> http://mail.python.org/mailman/listinfo/python-list

__________________________________
Do you Yahoo!?
Yahoo! Mail SpamGuard - Read only the mail you want.
http://antispam.yahoo.com/tools
-------------- next part --------------
A non-text attachment was scrubbed...
Name: features.py
Type: application/octet-stream
Size: 1065 bytes
Desc: features.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20040222/3caa724a/attachment.obj>


More information about the Python-list mailing list