Stripping HTML tags from a string

Wed May 2 15:12:29 EDT 2001

"Colin Meeks" <colinmeeks at home.com> wrote in message
news:lrYH6.2444$2_.528918 at news3.rdc1.on.home.com...
> I know I've seen this somewhere before, but can't find it now I want it.
> Does anybody know how to strip all HTML tags from a string. I imagine I
> would use a regular expression, but am not fully up to speed on these yet.

You _can_ do it with regular expressions, but it's hard to get full
generality.  Standard module sgmllib is SO much easier to use...

> i.e "<P>Hello<P><FONT FACE="Arial">This is really cool</FONT> isn't
> it<BR>The End"
> would give me "Hello This is really cool isn't it The End"
> I would like to replace all <P> and <BR> with a space as this would result
> in something that is more readable.

import sgmllib

class Cleaner(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.result = []
    def do_p(self, *junk):
        self.result.append(' ')
    def do_br(self, *junk):
        self.result.append(' ')
    def handle_data(self, data):
        self.result.append(data)
    def cleaned_text(self):
        return ''.join(self.result)

if __name__ == '__main__':
    data  = """<P>Hello<P><FONT FACE="Arial">This is really cool</FONT>
isn't
it<BR>The End"""
    parser = Cleaner()
    parser.feed(data)
    parser.close()
    print parser.cleaned_text()

Running this produces:

D:\ian>python kk.py
 Hello This is really cool isn't
it The End

D:\ian>

which isn't QUITE what you asked for, but then there are contradictions
between some aspects of your specs -- e.g. you specifically asked for
all <P> tags to be "replaced with a space", yet your example string
starts with a <P> but the desired result does NOT start with a space.

Anyway, I hope this is clear enough to let you solve such contradictions
and get exactly the kind of processing that you DO really require!

Alex