Problem with Python's "robots.txt" file parser in module robotparser
Nikita the Spider
NikitaTheSpider at gmail.com
Wed Jul 11 16:35:25 EDT 2007
In article <0T7li.5316$rL1.2716 at newssvr19.news.prodigy.net>,
John Nagle <nagle at animats.com> wrote:
> Python's "robots.txt" file parser may be misinterpreting a
> special case. Given a robots.txt file like this:
>
> User-agent: *
> Disallow: //
> Disallow: /account/registration
> Disallow: /account/mypro
> Disallow: /account/myint
> ...
>
> the python library "robotparser.RobotFileParser()" considers all pages of the
> site to be disallowed. Apparently "Disallow: //" is being interpreted as
> "Disallow: /". Even the home page of the site is locked out. This may be
> incorrect.
>
> This is the robots.txt file for "http://ibm.com".
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
I don't see the bug to which you're referring:
>>> import robotparser
>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.ibm.com/robots.txt")
>>> r.read()
>>> r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html")
1
>>> r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html")
0
>>>
I'll use this opportunity to shamelessly plug an alternate robots.txt
parser that I wrote to address some small bugs in the parser in the
standard library.
http://NikitaTheSpider.com/python/rerp/
Cheers
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
More information about the Python-list
mailing list