Looking for code which allows easy extraction of text from HTML
Joe Francia
usenet at soraia.com
Wed Mar 5 12:55:59 EST 2003
Use the SGMLParser in sgmllib, as it's slightly easier to use. Define a
start_<tagname> method for each <tagname> you will parse, and
handle_data(self, data) is called for all text between tags. The
following example extracts the text of each anchor in the Google start page:
from sgmllib import SGMLParser
import urllib
class ParseMe(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.in_a = 0
def start_a(self, attr):
self.in_a = 1
print '<',
def end_a(self):
self.in_a = 0
print '>'
def handle_data(self, data):
if self.in_a:
print data,
if __name__ == '__main__':
ht = ParseMe()
ht.feed(urllib.urlopen('http://www.google.com/').read())
ht.close()
Grzegorz Adam Hankiewicz wrote:
> Hello.
>
> I need to parse a few HTML pages which contain information. These
> pages were generated from a database and thus have a common HTML code
> structure. Is there a package which extracts text given a condition?
> I would need a re-like module for HTML code. I have thought of
> transforming the HTML to XML with HTMLParser and use minidom
> to extract the text with a few recursive text node extraction
> functions. Is there a better way?
>
More information about the Python-list
mailing list