Converting HTML to ASCII
Michael Spencer
mahs at telcopartners.com
Fri Feb 25 18:32:04 EST 2005
Mike Meyer wrote:
>
> It also fails on tags with a ">" in a string in the tag. That's
> well-formed but ill-used HTML.
>
> <mike
True enough...however, it doesn't fail too horribly:
>>> striptags("""<sometag attribute = '>'>the text</sometag>""")
"'>the text"
>>>
and I think that case could be rectified rather easily, by stripping any content
up to '>' in the result without breaking anything else.
BTW, I tool a first look at BeautifulSoup. As far as I could tell, there is no
built-in way to extract text from its parse tree, however adding one is trivial:
>>> from bsoup import BeautifulSoup, Tag
...
>>> def extracttext(obj):
... if isinstance(obj,Tag):
... return "".join(extracttext(c) for c in obj.contents)
... else:
... return str(obj)
...
>>> def bsouptext(text):
... souptree = BeautifulSoup(text)
... bodytext = extracttext(souptree.first())
... text = re.sub(comments,'', bodytext)
... text = collapsenewlines(text)
... return text
...
...
>>>
>>> bsouptext("""<sometag attribute = '>'>the text</sometag>""")
"'>the text"
On one 'real world test' (nytimes.com), I find the regexp approach to be more
accurate, but I won't load up this message with the output to prove it ;-)
Michael
More information about the Python-list
mailing list