[Tutor] HTML Parser woes

Alan Gauld alan.gauld at btinternet.com
Tue Mar 4 17:26:01 CET 2014


My turn to ask a question.
This has me pulling my hair out. Hopefully it's something obvious...

I'm trying to pull some dates out of an HTML web page generated
from an Excel spreadsheet.

I've simplified things somewhat so the file(sample.htm) looks like:

<html>
<body link=blue vlink=purple>

<table border=0 cellpadding=0 cellspacing=0 width=752 
style='border-collapse:
  collapse;table-layout:fixed;width:564pt'>
  <tr class=xl66 height=21 style='height:15.75pt'>
   <td height=21 class=xl66 width=64 
style='height:15.75pt;width:48pt'>ItemID</td>
   <td class=xl66 width=115 style='width:86pt'>Name</td>
   <td class=xl66 width=99 style='width:74pt'>DateLent</td>
   <td class=xl66 width=121 style='width:91pt'>DateReturned</td>
  </tr>
  <tr height=20 style='height:15.0pt'>
   <td height=20 align=right style='height:15.0pt'>1</td>
   <td>LawnMower</td>
   <td>Small Hover mower</td>
   <td>Fred</td>
   <td>Joe</td>
   <td class=xl65 align=right>4/1/2012</td>
   <td class=xl65 align=right>4/26/2012</td>
  </tr>
</table>
</body>
</html>

The code looks like:

import html.parser

class SampleParser(html.parser.HTMLParser):
     def __init__(self):
         super().__init__()
         self.isDate = False

     def handle_starttag(self, name, attributes):
         if name == 'td':
             for key, value in attributes:
                 if key == 'class':
                    print ('Class Value: ',repr(value))
                    if value.endswith('165'):
                       print ('We got a date')
                       self.isDate = True
                    break

     def handle_endtag(self,name):
         self.isDate = False

     def handle_data(self, data):
         if self.isDate:
             print('Date: ', data)

if __name__ == '__main__':
     print('start test')
     htm = open('sample.htm').read()
     parser = SampleParser()
     parser.feed(htm)
     print('end test')

And the output looks like:

start test
Class Value:  'xl66'
Class Value:  'xl66'
Class Value:  'xl66'
Class Value:  'xl66'
Class Value:  'xl65'
Class Value:  'xl65'
end test

As you can see I'm picking up the class attribute and
its value but the conditional test for x165 is failing.

I've tried

if value == 'x165'
if 'x165' in value

and every other test I can think of.

Why am I not seeing the "We got a date" message?

PS.
Please don't suggest other modules/packages etc,
I'm using html.parser for a reason.

Frustratedly,
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos



More information about the Tutor mailing list