Having problems with urlparser concatenation

i80and i80and at gmail.com
Thu Nov 9 22:50:57 EST 2006


Thank you!  Fixed my problem perfectly!
Gabriel Genellina wrote:
> At Thursday 9/11/2006 20:23, i80and wrote:
>
> >I'm working on a basic web spider, and I'm having problems with the
> >urlparser.
> >[...]
> >             SpliceStart = Website.find('<a href="', (i+1))
> >             SpliceEnd = (Website.find('">', SpliceStart))
> >
> >             ParsedURL =
> >urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
> >             robotparser.set_url(ParsedURL.hostname + '/' +
> >'robots.txt')
> >-----
> >Traceback (most recent call last):
> >   File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 120, in <module>
> >     FindLinks(Website)
> >   File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
> >line 84, in FindLinks
> >     robotparser.read()
> >   File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
> >     f = opener.open(self.url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
> >     return getattr(self, name)(url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 451, in
> >open_file
> >     return self.open_local_file(url)
> >   File "C:\Program Files\Python25\lib\urllib.py", line 465, in
> >open_local_file
> >     raise IOError(e.errno, e.strerror, e.filename)
> >IOError: [Errno 2] The system cannot find the path specified:
> >'en.wikipedia.org\\robots.txt'
> >
> >Note the last line 'en.wikipedia.org\\robots.txt'.  I want
> >'en.wikipedia.org/robots.txt'!  What am I doing wrong?
>
> No, you don't want 'en.wikipedia.org/robots.txt'; you want
> 'http://en.wikipedia.org/robots.txt'
> urllib treats the former as a file: request, here the \\ in the
> normalized path.
> You are parsing the link and then building a new URI using ONLY the
> hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.
>
> You may try Beautiful Soup for a better HTML parsing.
>
> --
> Gabriel Genellina
> Softlab SRL
>
> __________________________________________________
> Correo Yahoo!
> Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
> ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar




More information about the Python-list mailing list