HTMLParser skipping HTML? [newbie]
Peter Otten
__peter__ at web.de
Wed Sep 5 09:54:43 EDT 2012
BobAalsma wrote:
> I'm trying to understand the HTMLParser so I've copied some code from
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason
- any advice?
> I have searched extensively for this, but seem to be the only one with
missing data from HTMLParser :(
>
> Code:
> import urllib2
> from HTMLParser import HTMLParser
>
> from GetHttpFileContents import getHttpFileContents
>
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
> def handle_starttag(self, tag, attrs):
> print "Start tag:\n\t", tag
> for attr in attrs:
> print "\t\tattr:", attr
> # end for attr in attrs:
> #
> def handle_endtag(self, tag):
> print "End tag :\n\t", tag
> #
> def handle_data(self, data):
> if data != '\n\n':
> if data != '\n':
> print "Data :\t\t", data
> # end if 1
> # end if 2
Please no! A kitten dies every time you write one of those comments ;)
> def removeHtmlFromFileContents():
> TextOut = ''
>
> parser = MyHTMLParser()
> parser.feed(urllib2.urlopen(
> 'http://nl.linkedin.com/in/bobaalsma').read())
>
> return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
> TextOut = removeHtmlFromFileContents()
After removing
> from GetHttpFileContents import getHttpFileContents
from your script I get the following output (using python 2.7):
$ python parse_orig.py | grep meta -C2
script
Start tag:
meta
attr: ('http-equiv', 'content-type')
attr: ('content', 'text/html; charset=UTF-8')
Start tag:
meta
attr: ('http-equiv', 'X-UA-Compatible')
attr: ('content', 'IE=8')
Start tag:
meta
attr: ('name', 'description')
attr: ('content', 'Bekijk het (Nederland) professionele
profiel van Bob Aalsma op LinkedIn. LinkedIn is het grootste zakelijke
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners
vinden.')
Start tag:
meta
attr: ('name', 'pageImpressionID')
attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
meta
attr: ('name', 'pageKey')
attr: ('content', 'nprofile-public-success')
Start tag:
meta
attr: ('name', 'analyticsURL')
attr: ('content', '/analytics/noauthtracker')
$
So there definitely are some meta tags.
Note that if you're logged in into a site the html the browser is "seeing"
may differ from the html you are retrieving via urllib.urlopen(...).read().
Perhaps that is the reason why you don't get what you expect.
More information about the Python-list
mailing list