Strip HTML tags from downloaded files
Steve Holden
sholden at holdenweb.com
Wed Dec 5 22:29:45 EST 2001
"Thomas Pham" <tdpham at email.com> wrote ...
>
> When I use urlretrieve to download a file from the web, the raw text file
have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
> <HEAD>
>
>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
>
The easiest way is to use the DumbWriter class from the library. Here's an
example:
"""Uses HTMLParser and DumbWriter to remove HTML tags."""
import htmllib
import sys, formatter, urllib
try:
f = urllib.urlopen(sys.argv[1])
data = f.read()
if f is not sys.stdin:
f.close()
p =
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
p.feed(data)
p.close()
except IOError, msg:
print URL, ":", msg
regards
Steve
--
http://www.holdenweb.com/
More information about the Python-list
mailing list