HTMLParser error

alex23 wuwei23 at gmail.com
Wed May 21 21:40:08 EDT 2008


On May 22, 8:18 am, jonbutle... at googlemail.com wrote:
> Sorry, im new to both python and newsgroups, this is all pretty
> confusing. So I need a line in my __init__ function of my class? The
> spider class I made inherits from HTMLParser. Its just using the
> feed() function that produces errors though, the rest seems to work
> fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
see http://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

    class SpiderBroken(HTMLParser):
        def __init__(self):
            pass # don't do any ancestral setup

    class SpiderOldStyle(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)

    class SpiderNewStyle(HTMLParser, object):
        def __init__(self):
            super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May  1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> html = open('temp.html','r').read()
>>> from spider import *
>>> sb = SpiderBroken()
>>> sb.feed(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python25\lib\HTMLParser.py", line 107, in feed
    self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
>>> so = SpiderOldStyle()
>>> so.feed(html)
>>> sn = SpiderNewStyle()
>>> sn.feed(html)
>>>

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

    HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23



More information about the Python-list mailing list