How do I convert escaped HTML into a string?

Stefan Behnel stefan.behnel-n05pAM at web.de
Sat Nov 24 00:58:42 EST 2007


Just Another Victim of the Ambient Morality wrote:
>     I've done a google search on this but, amazingly, I'm the first guy to 
> ever need this!

You cannot infer that from a Google search.


>     So, how do I convert HTML to plaintext?  Something like this:
> 
> <div>This is a string.</div>
> 
>     ...into:
> 
> This is a string.
> 
>     Actually, the ideal would be a function that takes an HTML string and 
> convert it into a string that the HTML would correspond to.  For instance, 
> converting:
> 
> <div>This &    that
> or the other thing.</div>
> 
>     ...into:
> 
> This & that or the other thing.
> 
>     ...since HTML seems to convert any amount and type of whitespace into a 
> single space (a bizarre design choice if I've ever seen one).

So what you want to do is parse HTML and extract the text content. There are
quite a few ways to do that, including lxml.html:

http://codespeak.net/lxml/dev/lxmlhtml.html

    >>> htmldata = """<div>This &    that
    ... or the other thing.</div>
    >>> from lxml import html
    >>> print html.fragment_fromstring(htmldata).text_content()

Stefan



More information about the Python-list mailing list