text representation of HTML

Tim Williams listserver at tdw.net
Thu Jul 20 12:36:49 EDT 2006


On 20 Jul 2006 15:12:27 GMT, Duncan Booth <duncan.booth at invalid.invalid> wrote:
> Ksenia Marasanova wrote:
> > i want to send plain text alternative of html email, and would prefer
> > to do it automatically from HTML source.
> > Any hints?
>
> Use htmllib:
>
> >>> import htmllib, formatter, StringIO
> >>> def cleanup(s):
>     out = StringIO.StringIO()
>     p = htmllib.HTMLParser(
>         formatter.AbstractFormatter(formatter.DumbWriter(out)))
>     p.feed(s)
>     p.close()
>     if p.anchorlist:
>         print >>out
>         for idx,anchor in enumerate(p.anchorlist):
>             print >>out, "\n[%d]: %s" % (idx+1,anchor)
>     return out.getvalue()
>
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br
> />test</p></div>''')
>
> Title
>
> This is a
> test
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a
> href="http://python.org">a link</a> to the Python homepage</p></div>''')
>
> Title
>
> This is a
> test with a link[1] to the Python homepage
>
> [1]: http://python.org
>

cleanup()  doesn't handle script and styles too well.  html2text will
do a much better job of these and give a more structured output
(compatible with Markdown)

http://www.aaronsw.com/2002/html2text/

>>> import html2text
>>> print html2text.html2text('''<div><h1>Title</h1><p>This is a <br
/>test with <a href="http://python.org">a link</a> to the Python
homepage</p></div>''')

# Title

This is a
test with [a link][1] to the Python homepage

    [1]: http://python.org


HTH :)



More information about the Python-list mailing list