Strip HTML tags from downloaded files

Walter Dörwald walter at livinglogic.de
Thu Dec 6 08:02:52 EST 2001


Thomas Pham wrote:

> When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file.
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
>  <HEAD>
> 
> 
> 
> </PRE>
> </BODY></HTML>
> 
> Is there anyway to strip all the HTML tags from the file?


You could do this with XIST (http://www.livinglogic.de/Python/xist/)

Code might look like this:

from xist import parsers
from xist.ns import html

doc = parsers.parseTidyURL("http://www.freshmeat.net/", 
defaultEncoding="latin-1")
string = doc.find(type=html.pre, searchchildren=1)[0].asPlainString()

print string

HTH,
   Walter Dörwald







More information about the Python-list mailing list