how to extract only text from a html ?

Tue Oct 31 18:13:31 EST 2000

Hwanjo wrote:
> Could someone please tell me how to get rid of all the tags in a html ?
> It seems that the htmllib.HTMLParser is not helpful to do it.

here's one way to do it:

html = """
<html>
<body>
<h1>header</h1>
<p>this is some <i>html</i>
<p>some more text
<p>and here's a <a href="link">link</a>
</body>
</html>
"""

import htmllib, formatter
import StringIO

# create memory file
file = StringIO.StringIO()

# convert html to text
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = htmllib.HTMLParser(f)
p.feed(html)
p.close()
if p.anchorlist:
    file.write("\n\nlinks:\n")
    i = 1
    for anchor in p.anchorlist:
        file.write("%d: %s\n" % (i, anchor))
        i = i + 1

text = file.getvalue()

print text

## header
##
## this is some html
##
## some more text
##
## and here's a link[1]
##
## links:
## 1: link

</F>

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->