Converting HTML to ASCII
Mike Meyer
mwm at mired.org
Fri Feb 25 16:13:41 EST 2005
Michael Spencer <mahs at telcopartners.com> writes:
> gf gf wrote:
>> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]
>
> You haven't specified what you mean by "extracting" ASCII, but I'll
> assume that you want to start by eliminating html tags and comments,
> which is easy enough with a couple of regular expressions:
>
> >>> import re
> >>> comments = re.compile('<!--.*?-->', re.DOTALL)
> >>> tags = re.compile('<.*?>', re.DOTALL)
> ...
> >>> def striptags(text):
> ... text = re.sub(comments,'', text)
> ... text = re.sub(tags,'', text)
> ... return text
> ...
> >>> def collapsenewlines(text):
> ... return "\n".join(line for line in text.splitlines() if line)
> ...
> >>> import urllib2
> >>> f = urllib2.urlopen('http://www.python.org/')
> >>> source = f.read()
> >>> text = collapsenewlines(striptags(source))
> >>>
>
> This will of course fail if there is a "<" without a ">", probably in
> other cases too. But it is indifferent to whether the html is
> well-formed.
It also fails on tags with a ">" in a string in the tag. That's
well-formed but ill-used HTML.
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
More information about the Python-list
mailing list