Help With EOF character and regular expression matching: URGENT

Sun Feb 22 20:17:50 EST 2004

> Currently I have one problem and I dont know if there
> are any good ways to solve it in python:
> 
> I want to create a dictionary of words out of the spam
> datasets and legitimate email datasets. While I can
> extract each and every word from the spam and
> legitimate emails it is not advisable to do so. I want
> to strip off the headers,
> like:
> To
> From
> Returned Path
> etc...
> and also the characters that are not ASCII and also
> the characters that are between <> so as to avoid HTML
> Tags.
> I have zero experience with regular expressions
> but if you or some one can give me an idea/snippet I
> think I can make it work.
> Also while I can write the words extracted to a file
> what are the advisable ways to associate them with the
> index? Also I want to avoid writing in the dictionary
> the same 2 words with different indexes?
> Any help is highly appreciated...

I think Python is actually great for these kind of tasks.  One tool to help you with RE would be the "redemo.py" module.  I think that came with the Python 2.3 distribution.

Run redemo, paste in your text sample, and you can test your regular expressions interactively.

Also, maybe this will help with dups:

>>> text=[]
# ... words added to the list
# ... list sorted just to make the duplicates visually obvious here
>>> for each in text:
	print each

bigger
bigger
bigger
dam
damn
damn
girls
girls
girls
organ
organ
petite
sex
sex
sex
tasty
tea
tea
teen
teen
ten

>>> badList=[]
>>> for each in text:
	if each not in badList:
		badList.append(each)

>>> for each in badList:
	print each

bigger
dam
damn
girls
organ
petite
sex
tasty
tea
teen
ten
>>>