HTML data extraction?
djw
dwelch91.nospam at comcast.net
Mon Dec 22 15:00:08 EST 2003
I don't know if there is anything at a higher level (I guess a Google
session would tell you that), but doing what you describe with the
HTMLParser module is very straightforward. All you have to do is keep
some state flags in the derived HTMLParser class that indicate the
found/not-found state of what you are looking for and have that control
the collection of data between the flags.
Starting with the example in the docs, and adding some (untested) additions:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__( self ):
HTMLParser.__init__( self )
self.in_bold_tag = False
self.in_list_tag = False
self.data_in_bold_list = ''
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
if tag == 'b': self.in_bold_tag = True
if tag == 'li' : self.in_list_tag = True
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
if tag == 'b': self.in_bold_tag = False
if tag == 'li' : self.in_list_tag = False
def handle_data( self, data ):
if self.in_bold_tag and self.in_list_tag:
self.data_in_bold_list = ''.join( [ self.data_in_bold_list,
data ] )
This is just an outline, but you get the idea...
-Don
Dave Kuhlman wrote:
> I recently read an article by Jon Udell about extracting data from
> Web pages as a poor person's Web services. So, I have a question:
>
> Is there any Python support for finding and extracting information
> from HTML documents.
>
> I'd like something that would do things like the following:
>
> - return the data which is inside a <b> tag which is inside a
> <li> tag.
>
> - return the data which is inside a <a> tag that has attribute
> href="http://www.python.org".
>
> - Etc.
>
> It would be a sort of structured grep for HTML.
>
> I've found the HTMLParser and htmllib modules in the Python
> standard library, but I'm wondering if there is anything at a
> higher level.
>
> Web searches did not turn up anything interesting.
>
> Thanks for help.
>
> Dave
>
More information about the Python-list
mailing list