text representation of HTML

Duncan Booth duncan.booth at invalid.invalid
Thu Jul 20 11:12:27 EDT 2006


Ksenia Marasanova wrote:

> I am looking for a library that will give me very simple text
> representation of HTML.
> For example
><div><h1>Title</h1><p>This is a <br />test</p></div>
> 
> will be transformed to:
> 
> Title
> 
> This is a
> test
> 
> 
> i want to send plain text alternative of html email, and would prefer
> to do it automatically from HTML source.
> Any hints?

Use htmllib:

>>> import htmllib, formatter, StringIO
>>> def cleanup(s):
    out = StringIO.StringIO()
    p = htmllib.HTMLParser(
        formatter.AbstractFormatter(formatter.DumbWriter(out)))
    p.feed(s)
    p.close()
    if p.anchorlist:
        print >>out
        for idx,anchor in enumerate(p.anchorlist):
            print >>out, "\n[%d]: %s" % (idx+1,anchor)
    return out.getvalue()

>>> print cleanup('''<div><h1>Title</h1><p>This is a <br 
/>test</p></div>''')

Title

This is a
test
>>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a 
href="http://python.org">a link</a> to the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org





More information about the Python-list mailing list