[Tutor] HTML Parser woes

Mark Lawrence breamoreboy at yahoo.co.uk
Tue Mar 4 21:08:19 CET 2014


On 04/03/2014 16:26, Alan Gauld wrote:
> My turn to ask a question.
> This has me pulling my hair out. Hopefully it's something obvious...
>
> I'm trying to pull some dates out of an HTML web page generated
> from an Excel spreadsheet.
>
> I've simplified things somewhat so the file(sample.htm) looks like:
>
> <html>
> <body link=blue vlink=purple>
>
> <table border=0 cellpadding=0 cellspacing=0 width=752
> style='border-collapse:
>   collapse;table-layout:fixed;width:564pt'>
>   <tr class=xl66 height=21 style='height:15.75pt'>
>    <td height=21 class=xl66 width=64
> style='height:15.75pt;width:48pt'>ItemID</td>
>    <td class=xl66 width=115 style='width:86pt'>Name</td>
>    <td class=xl66 width=99 style='width:74pt'>DateLent</td>
>    <td class=xl66 width=121 style='width:91pt'>DateReturned</td>
>   </tr>
>   <tr height=20 style='height:15.0pt'>
>    <td height=20 align=right style='height:15.0pt'>1</td>
>    <td>LawnMower</td>
>    <td>Small Hover mower</td>
>    <td>Fred</td>
>    <td>Joe</td>
>    <td class=xl65 align=right>4/1/2012</td>
>    <td class=xl65 align=right>4/26/2012</td>
>   </tr>
> </table>
> </body>
> </html>
>
> The code looks like:
>
> import html.parser
>
> class SampleParser(html.parser.HTMLParser):
>      def __init__(self):
>          super().__init__()
>          self.isDate = False
>
>      def handle_starttag(self, name, attributes):
>          if name == 'td':
>              for key, value in attributes:
>                  if key == 'class':
>                     print ('Class Value: ',repr(value))
>                     if value.endswith('165'):
>                        print ('We got a date')
>                        self.isDate = True
>                     break
>
>      def handle_endtag(self,name):
>          self.isDate = False
>
>      def handle_data(self, data):
>          if self.isDate:
>              print('Date: ', data)
>
> if __name__ == '__main__':
>      print('start test')
>      htm = open('sample.htm').read()
>      parser = SampleParser()
>      parser.feed(htm)
>      print('end test')
>
> And the output looks like:
>
> start test
> Class Value:  'xl66'
> Class Value:  'xl66'
> Class Value:  'xl66'
> Class Value:  'xl66'
> Class Value:  'xl65'
> Class Value:  'xl65'
> end test
>
> As you can see I'm picking up the class attribute and
> its value but the conditional test for x165 is failing.
>
> I've tried
>
> if value == 'x165'
> if 'x165' in value
>
> and every other test I can think of.
>
> Why am I not seeing the "We got a date" message?
>
> PS.
> Please don't suggest other modules/packages etc,
> I'm using html.parser for a reason.
>
> Frustratedly,

Steven has pointed out the symptoms.  Cause, should have gone to 
Specsavers. :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com




More information about the Tutor mailing list