Converting HTML to ASCII
Michael Spencer
mahs at telcopartners.com
Fri Feb 25 14:58:17 EST 2005
gf gf wrote:
> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]
You haven't specified what you mean by "extracting" ASCII, but I'll assume that
you want to start by eliminating html tags and comments, which is easy enough
with a couple of regular expressions:
>>> import re
>>> comments = re.compile('<!--.*?-->', re.DOTALL)
>>> tags = re.compile('<.*?>', re.DOTALL)
...
>>> def striptags(text):
... text = re.sub(comments,'', text)
... text = re.sub(tags,'', text)
... return text
...
>>> def collapsenewlines(text):
... return "\n".join(line for line in text.splitlines() if line)
...
>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> source = f.read()
>>> text = collapsenewlines(striptags(source))
>>>
This will of course fail if there is a "<" without a ">", probably in other
cases too. But it is indifferent to whether the html is well-formed.
This leaves you with the additional task of substituting the html escaped
characters e.g., " ", not all of which will have ASCII representations.
HTH
Michael
More information about the Python-list
mailing list