Help with parsing web page

RiGGa rigga at hasnomail.com
Sat Jun 19 04:01:03 EDT 2004


RiGGa wrote:

> Miki Tebeka wrote:
> 
>> Hello RiGGa,
>> 
>>> Anyone?, I have found out I can use sgmllib but find the documentation
>>> is not that clear, if anyone knows of a tutorial or howto it would be
>>> appreciated.
>> I'm not an expert but this is how I work:
>> 
>> You make a subclass of HTMLParser and override the callback functions.
>> Usually I use only start_<TAB> end_<TAB> and handle_data.
>> Since you don't know *when* each callback function is called you need to
>> keep an internal state. It can be a simple variable or a stack if you
>> want to deal with nested tags.
>> 
>> A short example:
>> #!/usr/bin/env python
>> 
>> from htmllib import HTMLParser
>> from formatter import NullFormatter
>> 
>> class TitleParser(HTMLParser):
>>     def __init__(self):
>>         HTMLParser.__init__(self, NullFormatter())
>>         self.state = ""
>>         self.data = ""
>>     
>>     def start_title(self, attrs):
>>         self.state = "title"
>>         self.data = ""
>> 
>>     def end_title(self):
>>         print "Title:", self.data.strip()
>> 
>>     def handle_data(self, data):
>>         if self.state:
>>             self.data += data
>> 
>> if __name__ == "__main__":
>>     from sys import argv
>> 
>>     parser = TitleParser()
>>     parser.feed(open(argv[1]).read())
>> 
>> HTH.
>> --
>> -------------------------------------------------------------------------
>> Miki Tebeka <miki.tebeka at zoran.com>
>> The only difference between children and adults is the price of the toys.
> Thanks for taking the time to help its appreciated, I am new to Python so
> a little confused with what you have posted however I will go through it
> again and se if it makes more sense.
> 
> Many thanks
> 
> Rigga
Said I would be back :)

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R



More information about the Python-list mailing list