Stripping HTML tags from a string
Alex Martelli
aleaxit at yahoo.com
Wed May 2 15:12:29 EDT 2001
"Colin Meeks" <colinmeeks at home.com> wrote in message
news:lrYH6.2444$2_.528918 at news3.rdc1.on.home.com...
> I know I've seen this somewhere before, but can't find it now I want it.
> Does anybody know how to strip all HTML tags from a string. I imagine I
> would use a regular expression, but am not fully up to speed on these yet.
You _can_ do it with regular expressions, but it's hard to get full
generality. Standard module sgmllib is SO much easier to use...
> i.e "<P>Hello<P><FONT FACE="Arial">This is really cool</FONT> isn't
> it<BR>The End"
> would give me "Hello This is really cool isn't it The End"
> I would like to replace all <P> and <BR> with a space as this would result
> in something that is more readable.
import sgmllib
class Cleaner(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.result = []
def do_p(self, *junk):
self.result.append(' ')
def do_br(self, *junk):
self.result.append(' ')
def handle_data(self, data):
self.result.append(data)
def cleaned_text(self):
return ''.join(self.result)
if __name__ == '__main__':
data = """<P>Hello<P><FONT FACE="Arial">This is really cool</FONT>
isn't
it<BR>The End"""
parser = Cleaner()
parser.feed(data)
parser.close()
print parser.cleaned_text()
Running this produces:
D:\ian>python kk.py
Hello This is really cool isn't
it The End
D:\ian>
which isn't QUITE what you asked for, but then there are contradictions
between some aspects of your specs -- e.g. you specifically asked for
all <P> tags to be "replaced with a space", yet your example string
starts with a <P> but the desired result does NOT start with a space.
Anyway, I hope this is clear enough to let you solve such contradictions
and get exactly the kind of processing that you DO really require!
Alex
More information about the Python-list
mailing list