[Tutor] HTML Parser woes
Mark Lawrence
breamoreboy at yahoo.co.uk
Tue Mar 4 21:08:19 CET 2014
On 04/03/2014 16:26, Alan Gauld wrote:
> My turn to ask a question.
> This has me pulling my hair out. Hopefully it's something obvious...
>
> I'm trying to pull some dates out of an HTML web page generated
> from an Excel spreadsheet.
>
> I've simplified things somewhat so the file(sample.htm) looks like:
>
> <html>
> <body link=blue vlink=purple>
>
> <table border=0 cellpadding=0 cellspacing=0 width=752
> style='border-collapse:
> collapse;table-layout:fixed;width:564pt'>
> <tr class=xl66 height=21 style='height:15.75pt'>
> <td height=21 class=xl66 width=64
> style='height:15.75pt;width:48pt'>ItemID</td>
> <td class=xl66 width=115 style='width:86pt'>Name</td>
> <td class=xl66 width=99 style='width:74pt'>DateLent</td>
> <td class=xl66 width=121 style='width:91pt'>DateReturned</td>
> </tr>
> <tr height=20 style='height:15.0pt'>
> <td height=20 align=right style='height:15.0pt'>1</td>
> <td>LawnMower</td>
> <td>Small Hover mower</td>
> <td>Fred</td>
> <td>Joe</td>
> <td class=xl65 align=right>4/1/2012</td>
> <td class=xl65 align=right>4/26/2012</td>
> </tr>
> </table>
> </body>
> </html>
>
> The code looks like:
>
> import html.parser
>
> class SampleParser(html.parser.HTMLParser):
> def __init__(self):
> super().__init__()
> self.isDate = False
>
> def handle_starttag(self, name, attributes):
> if name == 'td':
> for key, value in attributes:
> if key == 'class':
> print ('Class Value: ',repr(value))
> if value.endswith('165'):
> print ('We got a date')
> self.isDate = True
> break
>
> def handle_endtag(self,name):
> self.isDate = False
>
> def handle_data(self, data):
> if self.isDate:
> print('Date: ', data)
>
> if __name__ == '__main__':
> print('start test')
> htm = open('sample.htm').read()
> parser = SampleParser()
> parser.feed(htm)
> print('end test')
>
> And the output looks like:
>
> start test
> Class Value: 'xl66'
> Class Value: 'xl66'
> Class Value: 'xl66'
> Class Value: 'xl66'
> Class Value: 'xl65'
> Class Value: 'xl65'
> end test
>
> As you can see I'm picking up the class attribute and
> its value but the conditional test for x165 is failing.
>
> I've tried
>
> if value == 'x165'
> if 'x165' in value
>
> and every other test I can think of.
>
> Why am I not seeing the "We got a date" message?
>
> PS.
> Please don't suggest other modules/packages etc,
> I'm using html.parser for a reason.
>
> Frustratedly,
Steven has pointed out the symptoms. Cause, should have gone to
Specsavers. :)
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com
More information about the Tutor
mailing list