Ask how to use HTMLParser

Dave Angel davea at ieee.org
Fri Jan 8 05:34:08 EST 2010


Water Lin wrote:
> h0uk <vardan.pogosyan at gmail.com> writes:
>
>   
>> On 8 янв, 08:44, Water Lin <Water... at ymail.invalid> wrote:
>>     
>>> I am a new guy to use Python, but I want to parse a html page now. I
>>> tried to use HTMLParse. Here is my sample code:
>>> ----------------------
>>> from HTMLParser import HTMLParser
>>> from urllib2 import urlopen
>>>
>>> class MyParser(HTMLParser):
>>>     title = ""
>>>     is_title = ""
>>>     def __init__(self, url):
>>>         HTMLParser.__init__(self)
>>>         req = urlopen(url)
>>>         self.feed(req.read())
>>>
>>>     def handle_starttag(self, tag, attrs):
>>>         if tag == 'div' and attrs[0][1] == 'articleTitle':
>>>             print "Found link => %s" % attrs[0][1]
>>>             self.is_title = 1
>>>
>>>     def handle_data(self, data):
>>>         if self.is_title:
>>>             print "here"
>>>             self.title = data
>>>             print self.title
>>>             self.is_title = 0
>>> -----------------------
>>>
>>> For the tag
>>> -------
>>> <div class="articleTitle">open article title</div>
>>> -------
>>>
>>> I use my code to parse it. I can locate the div tag but I don't know how
>>> to get the text for the tag which is "open article title" in my example.
>>>
>>> How can I get the html content? What's wrong in my handle_data function?
>>>
>>> Thanks
>>>
>>> Water Lin
>>>
>>> --
>>> Water Lin's notes and pencils:http://en.waterlin.org
>>> Email: Water... at ymail.com
>>>       
>> I want to say your code works well
>>     
>
> But in handle_data I can't print self.title. I don't why I can't set the
> self.title in handle_data.
>
> Thanks
>
> Water Lin
>
>   
I don't know HTMLParser, but I see a possible confusion point in your 
class definition.

You have both class-attributes and instance-attributes of the same names 
(title and is_title). So if you have more than one instance of MyParser, 
then they won't see each other's changes. Normally, I'd move the 
initialization of such attributes into the __init__() method, so the 
behavior is clear.

When an instance-attribute has the same name as a class-attribute, the 
instance-attribute takes precedence, and "hides" the class-attribute, 
for further processing in that same instance. So effectively, the 
class-attribute acts as a default value.





More information about the Python-list mailing list