HTMLParser error

alex23 wuwei23 at gmail.com
Wed May 21 06:04:37 EDT 2008


On May 21, 6:58 pm, jonbutle... at googlemail.com wrote:
> Its not a variable I set, its one of HTMLParser's inbuilt variables. I
> am using it with urlopen to get the source of a website and feed it to
> htmlparser.
>
> def parse(self, page):
>         try:
>             self.feed(urlopen('http://' + page).read())
>         except HTTPError:
>             print 'Error getting page source'
>
> This is the code I am using. I have tested the other modules and they
> work fine, but I havn't got a clue how to fix this one.

You're not providing enough information. Try to post a minimal code
fragment that demonstrates your error; it gives us all a common basis
for discussion.

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

    super(Spider, self).__init__()

If this is your issue, looking at the HTMLParser code you could get
away with just doing the following in __init__:

    self.reset()

This appears to be the function that adds the .rawdata attribute.

Ideally, you should use the former super() syntax...you're less
reliant on the implementation of HTMLParser that way.

- alex23



More information about the Python-list mailing list