[Email-SIG] A suggestion: HTML stripping

Fri Nov 21 15:18:55 EST 2003

> I had a suggestion from a happy email package user that I thought
> might be interesting to consider.  He was using email as a
> replacement for the Perl demime thingie.  He was generally happy
> about what email allowed him to do, except for one thing.  He was
> using a DecodedGenerator but wanted to strip text/html parts of its
> tags, leaving just plain text.

> In Mailman, I actually call out to something like lynx to render
> text/html into plain text, but I think he wanted something simpler.
> He just wanted to rip out all the tags, and ended up using an
> HTMLParser class to do this.

It's a sad state we've come to when we have to turn mail back into
text <wink>.

I have code that's just like what you describe. In fact it's a
slightly-twiddled version of some code that Alex Martelli posted to
comp.lang.python a while back. It's sufficiently trivial that I may
as well just paste it here in case it's of use to anyone:

# Very slightly modified from Alex Martelli's news post 
# <9cpm4202cv1 at news1.newsguy.com> of May 2, 2001,
# Subject: Stripping HTML tags from a string
# Thanks, Alex

class Cleaner(sgmllib.SGMLParser):
  entitydefs={"nbsp": " "} # I'll break if I want to

  def __init__(self):
    sgmllib.SGMLParser.__init__(self)
    self.result = []
  def do_p(self, *junk):
    self.result.append('\n')
  def do_br(self, *junk):
    self.result.append('\n')
  def handle_data(self, data):
    self.result.append(data)
  def cleaned_text(self):
    return ''.join(self.result)

def stripHTML(text):
  c=Cleaner()
  try:
    c.feed(text)
  except sgmllib.SGMLParseError:
    raise ValueError,"Unable to parse HTML"
  else:
    t=c.cleaned_text()
    return t

Regards,
Matt