Converting HTML to ASCII

Michael Spencer mahs at telcopartners.com
Fri Feb 25 18:32:04 EST 2005


Mike Meyer wrote:

> 
> It also fails on tags with a ">" in a string in the tag. That's
> well-formed but ill-used HTML.
> 
>             <mike
True enough...however, it doesn't fail too horribly:
  >>> striptags("""<sometag attribute = '>'>the text</sometag>""")
  "'>the text"
  >>>
and I think that case could be rectified rather easily, by stripping any content 
up to '>' in the result without breaking anything else.

BTW, I tool a first look at BeautifulSoup.  As far as I could tell, there is no
built-in way to extract text from its parse tree, however adding one is trivial:

  >>> from bsoup import BeautifulSoup, Tag
  ...
  >>> def extracttext(obj):
  ...     if isinstance(obj,Tag):
  ...         return "".join(extracttext(c) for c in obj.contents)
  ...     else:
  ...         return str(obj)
  ...
  >>> def bsouptext(text):
  ...     souptree = BeautifulSoup(text)
  ...     bodytext = extracttext(souptree.first())
  ...     text = re.sub(comments,'', bodytext)
  ...     text = collapsenewlines(text)
  ...     return text
  ...
  ...
  >>>

  >>> bsouptext("""<sometag attribute = '>'>the text</sometag>""")
  "'>the text"

On one 'real world test' (nytimes.com), I find the regexp approach to be more 
accurate, but I won't load up this message with the output to prove it ;-)

Michael





More information about the Python-list mailing list