HTMLParser handler_starttag misses lots of tags!

Sat Nov 22 09:06:11 EST 2003

In article <mailman.985.1069473518.702.python-list at python.org>, Robert Brewer wrote:
> When I try it like this:
> 
> import HTMLParser
> 
> class HP(HTMLParser.HTMLParser):
> 
>     def handle_starttag(self, tag, data):
>         print "tag is %s." % (tag)
> 
>     def handle_comment(self, data):
>         print "caught a comment: %s." % (data)
> 
>     def handle_data(self, data):
>         if "IP" in data:
>             print "Caught %s." % data
> 
> toParse = """
><html>
> 
><head>
> 	<meta http-equiv="content-type"
> content="text/html;charset=ISO-8859-1">
> 	<meta name="generator" content="Adobe GoLive 5">
> ------- 8< Much html snipped here for this email ---------
></TABLE>
></form>
></body>
> 
></html>"""
> 
> for line in toParse.split(u'\n'):
>     HP().feed(line)
> 
> 
> 
> I get:
> 
> tag is html.
> tag is head.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is title.
> tag is link.
> tag is script.
> tag is body.
> tag is form.
> tag is table.
> tag is tr.
> tag is td.
> tag is h1.
> caught a comment:  RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment:  END RULE //.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> caught a comment:  RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment:  END RULE //.
> tag is tr.
> tag is td.
> tag is span.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Address .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Subnet Mask .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> caught a comment:  RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment:  END RULE //.
> tag is tr.
> tag is td.
> tag is span.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Address .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Subnet Mask .
> tag is td.
> tag is table.
> tag is tr.
> tag is td.
> tag is img.
> tag is tr.
> tag is td.
> tag is span.
> tag is table.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is table.
> tag is td.
> tag is b.
> tag is td.
> tag is td.
> tag is b.
> tag is td.
> tag is td.
> tag is b.
> tag is td.
> tag is table.
> tag is tr.
> tag is td.
> tag is img.
> tag is tr.
> tag is td.
> tag is input.
> tag is input.
> 
> My guess is the problem lies in your line-separation logic, not
> HTMLParser. IIRC, open() doesn't split by line automatically. Note that
> this doesn't answer the entire question. My guess is that HTMLParser,
> once it encounters the form tag, treats everything inside that form tag
> (even other tags) as data to be consumed by handle_data(). Once it
> encounters the closing form tag, it might stop. Either re-feed() that or
> get the line-splitting right.
> 
> Just some out-loud thoughts.
> 
> 
> Robert Brewer
> MIS
> Amor Ministries
> fumanchu at amor.org
>

Thanks for the help. I tried running the program just as you wrote it,
and I still get the same results.  I also tried  feeding the whole file
at once:

hp = HP()
hp.feed(infile.read())

and this gave me the same results.