Converting HTML to ASCII

Fri Feb 25 14:58:17 EST 2005

gf gf wrote:
> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]

You haven't specified what you mean by "extracting" ASCII, but I'll assume that 
you want to start by eliminating html tags and comments, which is easy enough 
with a couple of regular expressions:

  >>> import re
  >>> comments = re.compile('<!--.*?-->', re.DOTALL)
  >>> tags = re.compile('<.*?>', re.DOTALL)
  ...
  >>> def striptags(text):
  ...     text = re.sub(comments,'', text)
  ...     text = re.sub(tags,'', text)
  ...     return text
  ...
  >>> def collapsenewlines(text):
  ...     return "\n".join(line for line in text.splitlines() if line)
  ...
  >>> import urllib2
  >>> f = urllib2.urlopen('http://www.python.org/')
  >>> source = f.read()
  >>> text = collapsenewlines(striptags(source))
  >>>

This will of course fail if there is a "<" without a ">", probably in other 
cases too.  But it is indifferent to whether the html is well-formed.

This leaves you with the additional task of substituting the html escaped 
characters e.g., " ", not all of which will have ASCII representations.

HTH

Michael