text representation of HTML
Tim Williams
listserver at tdw.net
Thu Jul 20 12:36:49 EDT 2006
On 20 Jul 2006 15:12:27 GMT, Duncan Booth <duncan.booth at invalid.invalid> wrote:
> Ksenia Marasanova wrote:
> > i want to send plain text alternative of html email, and would prefer
> > to do it automatically from HTML source.
> > Any hints?
>
> Use htmllib:
>
> >>> import htmllib, formatter, StringIO
> >>> def cleanup(s):
> out = StringIO.StringIO()
> p = htmllib.HTMLParser(
> formatter.AbstractFormatter(formatter.DumbWriter(out)))
> p.feed(s)
> p.close()
> if p.anchorlist:
> print >>out
> for idx,anchor in enumerate(p.anchorlist):
> print >>out, "\n[%d]: %s" % (idx+1,anchor)
> return out.getvalue()
>
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br
> />test</p></div>''')
>
> Title
>
> This is a
> test
> >>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a
> href="http://python.org">a link</a> to the Python homepage</p></div>''')
>
> Title
>
> This is a
> test with a link[1] to the Python homepage
>
> [1]: http://python.org
>
cleanup() doesn't handle script and styles too well. html2text will
do a much better job of these and give a more structured output
(compatible with Markdown)
http://www.aaronsw.com/2002/html2text/
>>> import html2text
>>> print html2text.html2text('''<div><h1>Title</h1><p>This is a <br
/>test with <a href="http://python.org">a link</a> to the Python
homepage</p></div>''')
# Title
This is a
test with [a link][1] to the Python homepage
[1]: http://python.org
HTH :)
More information about the Python-list
mailing list