HTMLParser handler_starttag misses lots of tags!
Matthew Wilson
mwilson at sarcastic-horse.com
Sat Nov 22 09:06:11 EST 2003
In article <mailman.985.1069473518.702.python-list at python.org>, Robert Brewer wrote:
> When I try it like this:
>
> import HTMLParser
>
> class HP(HTMLParser.HTMLParser):
>
> def handle_starttag(self, tag, data):
> print "tag is %s." % (tag)
>
> def handle_comment(self, data):
> print "caught a comment: %s." % (data)
>
> def handle_data(self, data):
> if "IP" in data:
> print "Caught %s." % data
>
> toParse = """
><html>
>
><head>
> <meta http-equiv="content-type"
> content="text/html;charset=ISO-8859-1">
> <meta name="generator" content="Adobe GoLive 5">
> ------- 8< Much html snipped here for this email ---------
></TABLE>
></form>
></body>
>
></html>"""
>
> for line in toParse.split(u'\n'):
> HP().feed(line)
>
>
>
> I get:
>
> tag is html.
> tag is head.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is meta.
> tag is title.
> tag is link.
> tag is script.
> tag is body.
> tag is form.
> tag is table.
> tag is tr.
> tag is td.
> tag is h1.
> caught a comment: RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment: END RULE //.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> caught a comment: RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment: END RULE //.
> tag is tr.
> tag is td.
> tag is span.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Address .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Subnet Mask .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> caught a comment: RULE //.
> tag is tr.
> tag is td.
> tag is img.
> caught a comment: END RULE //.
> tag is tr.
> tag is td.
> tag is span.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Address .
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is tr.
> tag is td.
> tag is b.
> Caught IP Subnet Mask .
> tag is td.
> tag is table.
> tag is tr.
> tag is td.
> tag is img.
> tag is tr.
> tag is td.
> tag is span.
> tag is table.
> tag is tr.
> tag is td.
> tag is b.
> tag is td.
> tag is table.
> tag is td.
> tag is b.
> tag is td.
> tag is td.
> tag is b.
> tag is td.
> tag is td.
> tag is b.
> tag is td.
> tag is table.
> tag is tr.
> tag is td.
> tag is img.
> tag is tr.
> tag is td.
> tag is input.
> tag is input.
>
> My guess is the problem lies in your line-separation logic, not
> HTMLParser. IIRC, open() doesn't split by line automatically. Note that
> this doesn't answer the entire question. My guess is that HTMLParser,
> once it encounters the form tag, treats everything inside that form tag
> (even other tags) as data to be consumed by handle_data(). Once it
> encounters the closing form tag, it might stop. Either re-feed() that or
> get the line-splitting right.
>
> Just some out-loud thoughts.
>
>
> Robert Brewer
> MIS
> Amor Ministries
> fumanchu at amor.org
>
Thanks for the help. I tried running the program just as you wrote it,
and I still get the same results. I also tried feeding the whole file
at once:
hp = HP()
hp.feed(infile.read())
and this gave me the same results.
More information about the Python-list
mailing list