Strip HTML tags from downloaded files
Walter Dörwald
walter at livinglogic.de
Thu Dec 6 08:02:52 EST 2001
Thomas Pham wrote:
> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
> <HEAD>
>
>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
You could do this with XIST (http://www.livinglogic.de/Python/xist/)
Code might look like this:
from xist import parsers
from xist.ns import html
doc = parsers.parseTidyURL("http://www.freshmeat.net/",
defaultEncoding="latin-1")
string = doc.find(type=html.pre, searchchildren=1)[0].asPlainString()
print string
HTH,
Walter Dörwald
More information about the Python-list
mailing list