HTMLParser skipping HTML? [newbie]

Peter Otten __peter__ at web.de
Wed Sep 5 09:54:43 EDT 2012


BobAalsma wrote:

> I'm trying to understand the HTMLParser so I've copied some code from 
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and 
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason 
- any advice?
> I have searched extensively for this, but seem to be the only one with 
missing data from HTMLParser :(
> 
> Code:
> import urllib2
> from HTMLParser import HTMLParser
> 
> from GetHttpFileContents import getHttpFileContents
> 
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
>         def handle_starttag(self, tag, attrs):
>                 print "Start tag:\n\t", tag
>                 for attr in attrs:
>                         print "\t\tattr:", attr
>                 # end for attr in attrs:
>         #
>         def handle_endtag(self, tag):
>                 print "End tag :\n\t", tag
>         #
>         def handle_data(self, data):
>                 if data != '\n\n':
>                         if data != '\n':
>                                 print "Data :\t\t", data
>                         # end if 1
>                 # end if 2

Please no! A kitten dies every time you write one of those comments ;)

> def removeHtmlFromFileContents():
>         TextOut = ''
> 
>         parser = MyHTMLParser()
>         parser.feed(urllib2.urlopen(
>         'http://nl.linkedin.com/in/bobaalsma').read())
> 
>         return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
>         TextOut = removeHtmlFromFileContents()


After removing 

> from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
        script
Start tag:
        meta
                attr: ('http-equiv', 'content-type')
                attr: ('content', 'text/html; charset=UTF-8')
Start tag:
        meta
                attr: ('http-equiv', 'X-UA-Compatible')
                attr: ('content', 'IE=8')
Start tag:
        meta
                attr: ('name', 'description')
                attr: ('content', 'Bekijk het (Nederland) professionele 
profiel van Bob Aalsma  op LinkedIn. LinkedIn is het grootste zakelijke 
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne 
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners 
vinden.')
Start tag:
        meta
                attr: ('name', 'pageImpressionID')
                attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
        meta
                attr: ('name', 'pageKey')
                attr: ('content', 'nprofile-public-success')
Start tag:
        meta
                attr: ('name', 'analyticsURL')
                attr: ('content', '/analytics/noauthtracker')
$ 

So there definitely are some meta tags. 

Note that if you're logged in into a site the html the browser is "seeing" 
may differ from the html you are retrieving via urllib.urlopen(...).read(). 
Perhaps that is the reason why you don't get what you expect.




More information about the Python-list mailing list