Having problems with urlparser concatenation
Gabriel Genellina
gagsl-py at yahoo.com.ar
Thu Nov 9 18:53:51 EST 2006
At Thursday 9/11/2006 20:23, i80and wrote:
>I'm working on a basic web spider, and I'm having problems with the
>urlparser.
>[...]
> SpliceStart = Website.find('<a href="', (i+1))
> SpliceEnd = (Website.find('">', SpliceStart))
>
> ParsedURL =
>urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
> robotparser.set_url(ParsedURL.hostname + '/' +
>'robots.txt')
>-----
>Traceback (most recent call last):
> File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
>line 120, in <module>
> FindLinks(Website)
> File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
>line 84, in FindLinks
> robotparser.read()
> File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
> f = opener.open(self.url)
> File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
> return getattr(self, name)(url)
> File "C:\Program Files\Python25\lib\urllib.py", line 451, in
>open_file
> return self.open_local_file(url)
> File "C:\Program Files\Python25\lib\urllib.py", line 465, in
>open_local_file
> raise IOError(e.errno, e.strerror, e.filename)
>IOError: [Errno 2] The system cannot find the path specified:
>'en.wikipedia.org\\robots.txt'
>
>Note the last line 'en.wikipedia.org\\robots.txt'. I want
>'en.wikipedia.org/robots.txt'! What am I doing wrong?
No, you don't want 'en.wikipedia.org/robots.txt'; you want
'http://en.wikipedia.org/robots.txt'
urllib treats the former as a file: request, here the \\ in the
normalized path.
You are parsing the link and then building a new URI using ONLY the
hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.
You may try Beautiful Soup for a better HTML parsing.
--
Gabriel Genellina
Softlab SRL
__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
More information about the Python-list
mailing list