Help with parsing web page

Miki Tebeka miki.tebeka at zoran.com
Tue Jun 15 06:18:17 EDT 2004


Hello RiGGa,

> Anyone?, I have found out I can use sgmllib but find the documentation is
> not that clear, if anyone knows of a tutorial or howto it would be
> appreciated.
I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self, NullFormatter())
        self.state = ""
        self.data = ""
    
    def start_title(self, attrs):
        self.state = "title"
        self.data = ""

    def end_title(self):
        print "Title:", self.data.strip()

    def handle_data(self, data):
        if self.state:
            self.data += data

if __name__ == "__main__":
    from sys import argv

    parser = TitleParser()
    parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <miki.tebeka at zoran.com>
The only difference between children and adults is the price of the toys.




More information about the Python-list mailing list