Help With EOF character and regular expression matching: URGENT

Robert Brewer fumanchu at amor.org
Sun Feb 22 19:58:22 EST 2004


> I want to strip off the headers, like: To From
> Returned Path etc...

Have a look at the 'email' module in the Library.

> and also the characters that are not ASCII and also
> the characters that are between <> so as to avoid HTML
> Tags.
> I have zero experience with regular expressions
> but if you or some one can give me an idea/snippet I
> think I can make it work.

import re
text = "look ma, <b>no</b> html!"
cleaned = re.sub(r'<[^>]*>', '', text)
print cleaned

> Also while I can write the words extracted to a file
> what are the advisable ways to associate them with the
> index? Also I want to avoid writing in the dictionary
> the same 2 words with different indexes?

Look at the 'sets' module in the Library.





More information about the Python-list mailing list