Strip HTML tags from downloaded files

Steve Holden sholden at holdenweb.com
Wed Dec 5 22:29:45 EST 2001


"Thomas Pham" <tdpham at email.com> wrote ...
>
> When I use urlretrieve to download a file from the web, the raw text file
have HTML tags embedded at the beginning and the end of the file.
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <HTML>
>  <HEAD>
>
>
>
> </PRE>
> </BODY></HTML>
>
> Is there anyway to strip all the HTML tags from the file?
>
The easiest way is to use the DumbWriter class from the library. Here's an
example:

"""Uses HTMLParser and DumbWriter to remove HTML tags."""

import htmllib

import sys, formatter, urllib

try:
    f = urllib.urlopen(sys.argv[1])
    data = f.read()
    if f is not sys.stdin:
        f.close()

    p =
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
    p.feed(data)
    p.close()
except IOError, msg:
    print URL, ":", msg

regards
 Steve
--
http://www.holdenweb.com/









More information about the Python-list mailing list