Strip HTML tags from downloaded files

Brett Cannon bac at OCF.Berkeley.EDU
Wed Dec 5 16:33:43 EST 2001


You can use a regular expression to get rid of them:

import re

HTML_re=re.compile(r'<[^>]+>\s*')  #\s to remove \n; else left in string.
non_HTML_text=HTML_re.sub('',HTML_text)


Can also use r'<.+?>\s*' for the regex.


-Brett Cannon



On Wed, 5 Dec 2001, Thomas Pham wrote:

>
> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
>  <HEAD>
>
>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
>
> Thanks,
> --
>
> _______________________________________________
> 1 cent a minute calls anywhere in the U.S.!
>
> http://www.getpennytalk.com/cgi-bin/adforward.cgi?p_key=RG9853KJ&url=http://www.getpennytalk.com
>
>
>
>
>




More information about the Python-list mailing list