html 2 plain text

garabik-news-2005-05 at kassiopeia.juls.savba.sk garabik-news-2005-05 at kassiopeia.juls.savba.sk
Mon May 29 02:44:16 EDT 2006


robin <robin.meier at gmail.com> wrote:
> hi,
> i remember seeing this simple python function which would take raw html
> and output the content (body?) of the page as plain text (no <..> tags
> etc)
> i have been looking at htmllib and htmlparser but this all seems to
> complicated for what i'm looking for. i just need the main text in the
> body of some arbitrary webbpage to then do some natural-language
> processing with it...
> thanks for pointing me to some helpful resources!

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



More information about the Python-list mailing list