HTMLParser skipping HTML? [newbie]

Thu Sep 6 04:46:34 EDT 2012

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
> 
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
> 
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
> 
> 
> 
> Code:
> 
> import urllib2
> 
> from HTMLParser import HTMLParser
> 
> 
> 
> from GetHttpFileContents import getHttpFileContents
> 
> 
> 
> # create a subclass and override the handler methods
> 
> class MyHTMLParser(HTMLParser):
> 
> 	def handle_starttag(self, tag, attrs):
> 
> 		print "Start tag:\n\t", tag
> 
> 		for attr in attrs:
> 
> 			print "\t\tattr:", attr
> 
> 		# end for attr in attrs:
> 
> 	#
> 
> 	def handle_endtag(self, tag):
> 
> 		print "End tag :\n\t", tag
> 
> 	#
> 
> 	def handle_data(self, data):
> 
> 		if data != '\n\n':
> 
> 			if data != '\n':
> 
> 				print "Data :\t\t", data
> 
> 			# end if 1
> 
> 		# end if 2
> 
> 	#
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> def removeHtmlFromFileContents():
> 
> 	TextOut = ''
> 
> 
> 
> 	parser = MyHTMLParser()
> 
> 	parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
> 
> 
> 
> 	return TextOut
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> if __name__ == '__main__':
> 
> 	TextOut = removeHtmlFromFileContents()
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Part of the output:
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	title
> 
> Data :		Bob Aalsma - Nederland | LinkedIn
> 
> End tag :
> 
> 	title
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	link
> 
> 		attr: ('rel', 'stylesheet')
> 
> 		attr: ('type', 'text/css')
> 
> 		attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
> 
> End tag :
> 
> 	script
> 
> End tag :
> 
> 	head
> 
> 
> 
> 
> 
> 
> 
> But the source text for this is [and all of the "<meta ...> seem to go missing:
> 
> </script>
> 
> <title>Bob Aalsma | LinkedIn</title>
> 
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
> 
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
> 
> <meta name="LinkedInBookmarkType" content="profile">
> 
> <meta name="ShortTitle" content="Bob Aalsma">
> 
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
> 
> <meta name="UniqueID" content="24198692">
> 
> <meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
> 
> </head>

No offense and thanks for the reminder.
My background is software packages in 3GL, where different platforms mean different editors which mean it is sometimes difficult to recognize the end of blocks, especially when nested.
No need for that here, no.
I think it also means I'm still not really satisfied with my commenting in Python...