HTMLParser handler_starttag misses lots of tags!

Peter Otten __peter__ at web.de
Sat Nov 22 10:01:05 EST 2003


Matthew Wilson wrote:

> I want to parse an html file and extract my router's IP address.  I
> wrote this code and I have python 2.3 installed:
> 
> #! /usr/bin/env python
> 
> import HTMLParser
> 
> class HP(HTMLParser.HTMLParser):
> 
>     def handle_starttag(self, tag, data):
>         print "tag is %s." % (tag)
> 
>     def handle_comment(self, data):
>         print "caught a comment: %s." % (data)
> 
>     def handle_data(self, data):
>         if "IP" in data:
>             print "Caught %s." % data
> 
> hp = HP()
> out = open('routerstatus.html')
> for line in out:
>     hp.feed(line)
> 
> 
> I figured that when I ran this on the html code at the bottom of this
> file, it would print every tag, but instead, this is what I got:
> 
> tag is html.
> tag is head.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is title.
> tag is link.
> tag is script.
> tag is body.
> tag is form.
> 
> The program seems to take a vacation after the opening form tag.  What
> am I doing wrong?
> 

Nothing, but your input file is not valid HTML and seems to puzzle the
parser. I recommend running it through tidy before you feed it to the
parser.

Peter




More information about the Python-list mailing list