Problem with Python's "robots.txt" file parser in module robotparser

Nikita the Spider NikitaTheSpider at gmail.com
Wed Jul 11 16:35:25 EDT 2007


In article <0T7li.5316$rL1.2716 at newssvr19.news.prodigy.net>,
 John Nagle <nagle at animats.com> wrote:

>    Python's "robots.txt" file parser may be misinterpreting a
> special case.  Given a robots.txt file like this:
> 
> 	User-agent: *
> 	Disallow: //
> 	Disallow: /account/registration
> 	Disallow: /account/mypro
> 	Disallow: /account/myint
> 	...
> 
> the python library "robotparser.RobotFileParser()" considers all pages of the
> site to be disallowed.  Apparently  "Disallow: //" is being interpreted as
> "Disallow: /".  Even the home page of the site is locked out. This may be 
> incorrect.
> 
> This is the robots.txt file for "http://ibm.com".

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at 
www.ibm.com contains the double slashed path. The robots.txt file at 
ibm.com  is different and contains this which would explain why you 
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:
>>> import robotparser
>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.ibm.com/robots.txt")
>>> r.read()
>>> r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html")
1
>>> r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html")
0
>>> 

I'll use this opportunity to shamelessly plug an alternate robots.txt 
parser that I wrote to address some small bugs in the parser in the 
standard library. 
http://NikitaTheSpider.com/python/rerp/

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more



More information about the Python-list mailing list