HTMLParser handler_starttag misses lots of tags!

Matthew Wilson mwilson at sarcastic-horse.com
Fri Nov 21 21:44:04 EST 2003


I want to parse an html file and extract my router's IP address.  I
wrote this code and I have python 2.3 installed:

#! /usr/bin/env python

import HTMLParser

class HP(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, data):
        print "tag is %s." % (tag)

    def handle_comment(self, data):
        print "caught a comment: %s." % (data)

    def handle_data(self, data):
        if "IP" in data:
            print "Caught %s." % data

hp = HP()
out = open('routerstatus.html')
for line in out:
    hp.feed(line)


I figured that when I ran this on the html code at the bottom of this
file, it would print every tag, but instead, this is what I got:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.

The program seems to take a vacation after the opening form tag.  What
am I doing wrong?

Finally, this is the html code I am trying to parse:



<html>

<head>
	<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
	<meta name="generator" content="Adobe GoLive 5">
	<META http-equiv='Pragma' CONTENT='no-cache'>
	<META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">	
	<META http-equiv='Refresh' CONTENT='20'>
	<title>router form</title> 
	<link rel="stylesheet" href="form.css">
<script language="javascript" type="text/javascript">
<!-- hide script from old browsers
function loadhelp(num) {

parent.helpframe.document.location.href="help/help"+num+".html"

	}
function newwindow(F)
{
if((F.status.value =="checked")||(F.EncapPTelstra.value=="checked")||(F.EncapAolDhcp.value=="checked"))
	window.open('enatherstatus.htm', 'enstatherstatus', 'width=380,height=450,status=yes');
else if((F.EncapPPTP.value =="checked"))
	window.open('pptpstatus.htm', 'pptpstatus', 'width=380,height=320,status=yes');
else
  window.open('pppoestatus.htm', 'pppoestatus', 'width=380,height=320,status=yes');

  

}
//-->
</script> 
</head>
<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="loadhelp('_SysStatus')">
<form method="POST">
<input type=hidden name=status value=>
<input type="hidden" name=EncapPTelstra  value=>
<input type="hidden" name=EncapPPTP  value=>
<input type="hidden" name=EncapAolDhcp  value=>
	<table border="0" cellpadding="0" cellspacing="3" width="100%">
		<tr>
			<td colspan="2">
				<h1>Router Status</h1> 
			</td>
		</tr>
<!-- RULE //-->
		<tr>
			<td colspan="2">
				<img src="img/liteblue.gif" width="100%" height="2" border="0"> 
			</td>
		</tr>
<!-- END RULE //-->
		<tr>
			<td width="60%">
				<b>Account Name</b> 
			</td>
			<td width="40%">
				
			</td>
		</tr>
		
		<tr>
			<td width="60%">
				<b>Firmware Version </b> 
			</td>
			<td width="40%">
				4.13 Aug 20 2003
			</td>
		</tr>
		
<!-- RULE //-->
		<tr>
			<td colspan="2">
				<img src="img/liteblue.gif" width="100%" height="2" border="0"> 
			</td>
		</tr>
<!-- END RULE //-->
		<tr>
			<td colspan="2">
				<span class="subhead">Internet Port </span> 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>MAC Address </b> 
			</td>
			<td width="40%">
				00:09:5b:29:3d:b4
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>IP Address </b> 
			</td>
			<td width="40%">
				66.72.206.129 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>DHCP </b> 
			</td>
			<td width="40%">
				None 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>IP Subnet Mask </b> 
			</td>
			<td width="40%">
				None 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>Domain Name Server</b>
			</td>
			<td width="40%">
				66.73.20.40 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b></b> 
			</td>
			<td width="40%">
				206.141.193.55 
			</td>
		</tr>
<!-- RULE //-->
		<tr>
			<td colspan="2">
				<img src="img/liteblue.gif" width="100%" height="2" border="0"> 
			</td>
		</tr>
<!-- END RULE //-->
		<tr>
			<td colspan="2">
				<span class="subhead">LAN Port </span> 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>MAC Address </b> 
			</td>
			<td width="40%">
				00:09:5b:29:3d:b3
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>IP Address </b> 
			</td>
			<td width="40%">
				192.168.0.1 
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>DHCP </b> 
			</td>
			<td width="40%">
				Server  
			</td>
		</tr>
		<tr>
			<td width="60%">
				<b>IP Subnet Mask </b> 
			</td>
			<td width="40%">
				255.255.255.0  
			</td>
		</tr>

	</table>
	<TABLE border=0 width="100%">
   <tr width="100%">
			<td>
				<img src="img/liteblue.gif" width="100%" height="2" border="0"> 
			</td>
  </tr>  
  <TR width="100%">
    <TD>
    
    <span class="subhead">Wireless Port </span>
    </TD>
  </TR>
 
  </TABLE>

<TABLE width="100%" border=0>
  
  <TR>
    <TD width="60%"><b>MAC Address 	 
      (BSSID) </b></TD>
    <TD width="40%">00:09:5b:29:3d:b3</TD></TR>
  </table>  
 <TABLE width="100%" cellSpacing=2 border=0>   
     <TD width="60%"><b>Name (SSID)</b></TD>
    <TD width="40%">natchieland</TD></tr>
    <TD width="60%"><b>Region</b></TD>
    <TD width="40%">USA</TD></tr>
    <TD width="60%"><b>Channel</b></TD>
    <TD width="40%">1</TD></tr>

</table>
 <TABLE width="100%" cellSpacing=2 border=0>

		<tr>
			<td colspan="2">
				<img src="img/liteblue.gif" width="100%" height="2" border="0"> 
			</td>
		</tr>

		<tr>
			<td align='center'>
				<input type="BUTTON" value="Show Statistics" onclick="window.open('mtenSysStatistics.htm','static','width=500,height=200,status=yes, resizable=yes');"> 
			    <INPUT onclick="newwindow(this.form);" type=button value="Connection Status"> 
            </TD>
		</tr>
</TABLE>
</form>
</body>

</html>




More information about the Python-list mailing list