Help With EOF character and regular expression matching: URGENT
Eric @ Zomething
eric at zomething.com
Sun Feb 22 20:17:50 EST 2004
> Currently I have one problem and I dont know if there
> are any good ways to solve it in python:
>
> I want to create a dictionary of words out of the spam
> datasets and legitimate email datasets. While I can
> extract each and every word from the spam and
> legitimate emails it is not advisable to do so. I want
> to strip off the headers,
> like:
> To
> From
> Returned Path
> etc...
> and also the characters that are not ASCII and also
> the characters that are between <> so as to avoid HTML
> Tags.
> I have zero experience with regular expressions
> but if you or some one can give me an idea/snippet I
> think I can make it work.
> Also while I can write the words extracted to a file
> what are the advisable ways to associate them with the
> index? Also I want to avoid writing in the dictionary
> the same 2 words with different indexes?
> Any help is highly appreciated...
I think Python is actually great for these kind of tasks. One tool to help you with RE would be the "redemo.py" module. I think that came with the Python 2.3 distribution.
Run redemo, paste in your text sample, and you can test your regular expressions interactively.
Also, maybe this will help with dups:
>>> text=[]
# ... words added to the list
# ... list sorted just to make the duplicates visually obvious here
>>> for each in text:
print each
bigger
bigger
bigger
dam
damn
damn
girls
girls
girls
organ
organ
petite
sex
sex
sex
tasty
tea
tea
teen
teen
ten
>>> badList=[]
>>> for each in text:
if each not in badList:
badList.append(each)
>>> for each in badList:
print each
bigger
dam
damn
girls
organ
petite
sex
tasty
tea
teen
ten
>>>
More information about the Python-list
mailing list