Ask how to use HTMLParser
Dave Angel
davea at ieee.org
Fri Jan 8 05:34:08 EST 2010
Water Lin wrote:
> h0uk <vardan.pogosyan at gmail.com> writes:
>
>
>> On 8 янв, 08:44, Water Lin <Water... at ymail.invalid> wrote:
>>
>>> I am a new guy to use Python, but I want to parse a html page now. I
>>> tried to use HTMLParse. Here is my sample code:
>>> ----------------------
>>> from HTMLParser import HTMLParser
>>> from urllib2 import urlopen
>>>
>>> class MyParser(HTMLParser):
>>> title = ""
>>> is_title = ""
>>> def __init__(self, url):
>>> HTMLParser.__init__(self)
>>> req = urlopen(url)
>>> self.feed(req.read())
>>>
>>> def handle_starttag(self, tag, attrs):
>>> if tag == 'div' and attrs[0][1] == 'articleTitle':
>>> print "Found link => %s" % attrs[0][1]
>>> self.is_title = 1
>>>
>>> def handle_data(self, data):
>>> if self.is_title:
>>> print "here"
>>> self.title = data
>>> print self.title
>>> self.is_title = 0
>>> -----------------------
>>>
>>> For the tag
>>> -------
>>> <div class="articleTitle">open article title</div>
>>> -------
>>>
>>> I use my code to parse it. I can locate the div tag but I don't know how
>>> to get the text for the tag which is "open article title" in my example.
>>>
>>> How can I get the html content? What's wrong in my handle_data function?
>>>
>>> Thanks
>>>
>>> Water Lin
>>>
>>> --
>>> Water Lin's notes and pencils:http://en.waterlin.org
>>> Email: Water... at ymail.com
>>>
>> I want to say your code works well
>>
>
> But in handle_data I can't print self.title. I don't why I can't set the
> self.title in handle_data.
>
> Thanks
>
> Water Lin
>
>
I don't know HTMLParser, but I see a possible confusion point in your
class definition.
You have both class-attributes and instance-attributes of the same names
(title and is_title). So if you have more than one instance of MyParser,
then they won't see each other's changes. Normally, I'd move the
initialization of such attributes into the __init__() method, so the
behavior is clear.
When an instance-attribute has the same name as a class-attribute, the
instance-attribute takes precedence, and "hides" the class-attribute,
for further processing in that same instance. So effectively, the
class-attribute acts as a default value.
More information about the Python-list
mailing list