how to get rid of html tags

Ian Bicking ianb at colorstudy.com
Thu Oct 3 00:33:27 EDT 2002


The easy answer:

page = re.sub(r'<.*?>', '', page)

There may be more Correct answers, though.  (Some HTML has unquoted <>
characters, which browsers accept even though it's super annoying to
parse -- but I don't know that htmllib parses improper HTML either)

On Wed, 2002-10-02 at 20:04, koko wrote:
> I am trying to retrieve a web page.
> But I only want to keep the content of the webpage without the html tags.
> How can I  parse the webpage to get rid of the tags?
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list






More information about the Python-list mailing list