how to extract only text from a html ?
Fredrik Lundh
fredrik at effbot.org
Tue Oct 31 18:13:31 EST 2000
Hwanjo wrote:
> Could someone please tell me how to get rid of all the tags in a html ?
> It seems that the htmllib.HTMLParser is not helpful to do it.
here's one way to do it:
html = """
<html>
<body>
<h1>header</h1>
<p>this is some <i>html</i>
<p>some more text
<p>and here's a <a href="link">link</a>
</body>
</html>
"""
import htmllib, formatter
import StringIO
# create memory file
file = StringIO.StringIO()
# convert html to text
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = htmllib.HTMLParser(f)
p.feed(html)
p.close()
if p.anchorlist:
file.write("\n\nlinks:\n")
i = 1
for anchor in p.anchorlist:
file.write("%d: %s\n" % (i, anchor))
i = i + 1
text = file.getvalue()
print text
## header
##
## this is some html
##
## some more text
##
## and here's a link[1]
##
## links:
## 1: link
</F>
<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list