Strip HTML tags from downloaded files
Brett Cannon
bac at OCF.Berkeley.EDU
Wed Dec 5 16:33:43 EST 2001
You can use a regular expression to get rid of them:
import re
HTML_re=re.compile(r'<[^>]+>\s*') #\s to remove \n; else left in string.
non_HTML_text=HTML_re.sub('',HTML_text)
Can also use r'<.+?>\s*' for the regex.
-Brett Cannon
On Wed, 5 Dec 2001, Thomas Pham wrote:
>
> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
> <HEAD>
>
>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
>
> Thanks,
> --
>
> _______________________________________________
> 1 cent a minute calls anywhere in the U.S.!
>
> http://www.getpennytalk.com/cgi-bin/adforward.cgi?p_key=RG9853KJ&url=http://www.getpennytalk.com
>
>
>
>
>
More information about the Python-list
mailing list