HTMLParser handler_starttag misses lots of tags!
Peter Otten
__peter__ at web.de
Sat Nov 22 10:01:05 EST 2003
Matthew Wilson wrote:
> I want to parse an html file and extract my router's IP address. I
> wrote this code and I have python 2.3 installed:
>
> #! /usr/bin/env python
>
> import HTMLParser
>
> class HP(HTMLParser.HTMLParser):
>
> def handle_starttag(self, tag, data):
> print "tag is %s." % (tag)
>
> def handle_comment(self, data):
> print "caught a comment: %s." % (data)
>
> def handle_data(self, data):
> if "IP" in data:
> print "Caught %s." % data
>
> hp = HP()
> out = open('routerstatus.html')
> for line in out:
> hp.feed(line)
>
>
> I figured that when I ran this on the html code at the bottom of this
> file, it would print every tag, but instead, this is what I got:
>
> tag is html.
> tag is head.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is title.
> tag is link.
> tag is script.
> tag is body.
> tag is form.
>
> The program seems to take a vacation after the opening form tag. What
> am I doing wrong?
>
Nothing, but your input file is not valid HTML and seems to puzzle the
parser. I recommend running it through tidy before you feed it to the
parser.
Peter
More information about the Python-list
mailing list