Ask how to use HTMLParser

h0uk vardan.pogosyan at gmail.com
Fri Jan 8 03:24:00 EST 2010


On 8 янв, 11:44, Water Lin <Water... at ymail.invalid> wrote:
> h0uk <vardan.pogos... at gmail.com> writes:
> > On 8 янв, 08:44, Water Lin <Water... at ymail.invalid> wrote:
> >> I am a new guy to use Python, but I want to parse a html page now. I
> >> tried to use HTMLParse. Here is my sample code:
> >> ----------------------
> >> from HTMLParser import HTMLParser
> >> from urllib2 import urlopen
>
> >> class MyParser(HTMLParser):
> >>     title = ""
> >>     is_title = ""
> >>     def __init__(self, url):
> >>         HTMLParser.__init__(self)
> >>         req = urlopen(url)
> >>         self.feed(req.read())
>
> >>     def handle_starttag(self, tag, attrs):
> >>         if tag == 'div' and attrs[0][1] == 'articleTitle':
> >>             print "Found link => %s" % attrs[0][1]
> >>             self.is_title = 1
>
> >>     def handle_data(self, data):
> >>         if self.is_title:
> >>             print "here"
> >>             self.title = data
> >>             print self.title
> >>             self.is_title = 0
> >> -----------------------
>
> >> For the tag
> >> -------
> >> <div class="articleTitle">open article title</div>
> >> -------
>
> >> I use my code to parse it. I can locate the div tag but I don't know how
> >> to get the text for the tag which is "open article title" in my example.
>
> >> How can I get the html content? What's wrong in my handle_data function?
>
> >> Thanks
>
> >> Water Lin
>
> >> --
> >> Water Lin's notes and pencils:http://en.waterlin.org
> >> Email: Water... at ymail.com
>
> > I want to say your code works well
>
> But in handle_data I can't print self.title. I don't why I can't set the
> self.title in handle_data.
>
> Thanks
>
> Water Lin
>
> --
> Water Lin's notes and pencils:http://en.waterlin.org
> Email: Water... at ymail.com

I have tested your code as :

#!/usr/bin/env python
# -*- conding: utf-8 -*-

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    title = ""
    is_title = ""
    def __init__(self, data):
        HTMLParser.__init__(self)
        self.feed(data)

    def handle_starttag(self, tag, attrs):
        if tag == 'div' and attrs[0][1] == 'articleTitle':
            print "Found link => %s" % attrs[0][1]
            self.is_title = 1

    def handle_data(self, data):
        if self.is_title:
            print "here"
            self.title = data
            print self.title
            self.is_title = 0


if __name__ == "__main__":

	m  = MyParser(""" <div class="secttlbarwrap">
					  <table cellpadding=0 cellspacing=0 width="100%"><tr><td>
					  <div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
					  <td bgcolor="#999999" width="100%" height="4"><img alt=""
width=1 height=1><td>
					  <div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) -4px 0px; width: 4px; height: 4px">
					  </div></table></div>
					<div class="articleTitle">open article title</div>
					  <div class="secttlbar">
					  <div class="lf secttl">
					  <span id="thread_subject_site">
					  Ask how to use HTMLParser
					  </span>
					  </div>
					  <div class="rf secmsg frtxt padt2">
					  <a class="uitl" id="showoptions_lnk2" href="#"
onclick="TH_ToggleOptionsPane(); return false;">Parametrs</a>
					  </div>
					  <div class="hght0 clear" style="font-size:0;"></div>
					  </div>""")



All stuff printed and handled fine. Also, the 'print self.title'
statement works fine.
Try run my code.

Vardan.



More information about the Python-list mailing list