HTMLParser error

Thu May 22 04:59:28 EDT 2008

On May 22, 6:22 pm, jonbutle... at googlemail.com wrote:
> Still getting very odd errors though, this being the latest:
>
> Traceback (most recent call last):
>   File "spider.py", line 38, in <module>
> [...snip...]
>     raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
> httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

    def parse(self, page):
        try:
            print 'http://' + page
            self.feed(urlopen('http://' + page).read())
        except HTTPError:
            print 'Error getting page source'

And here's the output:

    >python spider.py
    What site would you like to scan? http://www.google.com
    http://www.google.com
    http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

    if __name__ == '__main__':
        s = Spider()
        site = raw_input("What site would you like to scan? http://")
        site = 'http://' + site
        s.crawl(site)

> Also could you explain why I needed to add that
> HTMLParser.__init__(self) line? Does it matter that I have overwritten
> the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

    class HTMLParser(markupbase.ParserBase):
        def __init__(self):
            """Initialize and reset this instance."""
            self.reset()

        def reset(self):
            """Reset this instance.  Loses all unprocessed data."""
            self.rawdata = ''
            self.lasttag = '???'
            self.interesting = interesting_normal
            markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.