HTMLParser error

jonbutler88 at googlemail.com jonbutler88 at googlemail.com
Thu May 22 15:06:18 EDT 2008


On May 22, 9:59 am, alex23 <wuwe... at gmail.com> wrote:
> On May 22, 6:22 pm, jonbutle... at googlemail.com wrote:
>
> > Still getting very odd errors though, this being the latest:
>
> > Traceback (most recent call last):
> >   File "spider.py", line 38, in <module>
> > [...snip...]
> >     raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
> > httplib.InvalidURL: nonnumeric port: ''
>
> Okay. What I did was put some output in your Spider.parse method:
>
>     def parse(self, page):
>         try:
>             print 'http://' + page
>             self.feed(urlopen('http://' + page).read())
>         except HTTPError:
>             print 'Error getting page source'
>
> And here's the output:
>
>     >python spider.py
>     What site would you like to scan?http://www.google.com
>    http://www.google.com
>    http://http://images.google.com.au/imghp?hl=en&tab=wi
>
> The links you're finding on each page already have the protocol
> specified. I'd remove the 'http://' addition from parse, and just add
> it to 'site' in the main section.
>
>     if __name__ == '__main__':
>         s = Spider()
>         site = raw_input("What site would you like to scan? http://")
>         site = 'http://' + site
>         s.crawl(site)
>
> > Also could you explain why I needed to add that
> > HTMLParser.__init__(self) line? Does it matter that I have overwritten
> > the __init__ function of spider?
>
> You haven't overwritten Spider.__init__. What you're doing every time
> you create a Spider object is first get HTMLParser to initialise it as
> it would any other HTMLParser object - which is what adds the .rawdata
> attribute to each HTMLParser instance - *and then* doing the Spider-
> specific initialisation you need.
>
> Here's an abbreviated copy of the actual HTMLParser class featuring
> only its __init__ and reset methods:
>
>     class HTMLParser(markupbase.ParserBase):
>         def __init__(self):
>             """Initialize and reset this instance."""
>             self.reset()
>
>         def reset(self):
>             """Reset this instance.  Loses all unprocessed data."""
>             self.rawdata = ''
>             self.lasttag = '???'
>             self.interesting = interesting_normal
>             markupbase.ParserBase.reset(self)
>
> When you initialise an instance of HTMLParser, it calls its reset
> method, which sets rawdata to an empty string, or adds it to the
> instance if it doesn't already exist. So when you call
> HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
> method on the Spider instance, which it inherits from HTMLParser...
>
> Are you familiar with object oriented design at all? If you're not,
> let me know and I'll track down some decent intro docs. Inheritance is
> a pretty fundamental concept but I don't think I'm doing it justice.

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Thanks



More information about the Python-list mailing list