[Email-SIG] A suggestion: HTML stripping
Matthew Dixon Cowles
matt at mondoinfo.com
Fri Nov 21 15:18:55 EST 2003
> I had a suggestion from a happy email package user that I thought
> might be interesting to consider. He was using email as a
> replacement for the Perl demime thingie. He was generally happy
> about what email allowed him to do, except for one thing. He was
> using a DecodedGenerator but wanted to strip text/html parts of its
> tags, leaving just plain text.
> In Mailman, I actually call out to something like lynx to render
> text/html into plain text, but I think he wanted something simpler.
> He just wanted to rip out all the tags, and ended up using an
> HTMLParser class to do this.
It's a sad state we've come to when we have to turn mail back into
text <wink>.
I have code that's just like what you describe. In fact it's a
slightly-twiddled version of some code that Alex Martelli posted to
comp.lang.python a while back. It's sufficiently trivial that I may
as well just paste it here in case it's of use to anyone:
# Very slightly modified from Alex Martelli's news post
# <9cpm4202cv1 at news1.newsguy.com> of May 2, 2001,
# Subject: Stripping HTML tags from a string
# Thanks, Alex
class Cleaner(sgmllib.SGMLParser):
entitydefs={"nbsp": " "} # I'll break if I want to
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.result = []
def do_p(self, *junk):
self.result.append('\n')
def do_br(self, *junk):
self.result.append('\n')
def handle_data(self, data):
self.result.append(data)
def cleaned_text(self):
return ''.join(self.result)
def stripHTML(text):
c=Cleaner()
try:
c.feed(text)
except sgmllib.SGMLParseError:
raise ValueError,"Unable to parse HTML"
else:
t=c.cleaned_text()
return t
Regards,
Matt
More information about the Email-SIG
mailing list