Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

Nikita the Spider NikitaTheSpider at gmail.com
Thu Oct 4 12:14:40 EDT 2007


In article <ActMi.30614$eY.11375 at newssvr13.news.prodigy.net>,
 John Nagle <nagle at animats.com> wrote:

> Filip Salomonsson wrote:
> > On 02/10/2007, John Nagle <nagle at animats.com> wrote:
> >> But there's something in there now that robotparser doesn't like.
> >> Any ideas?
> > 
> > Wikipedia denies _all_ access for the standard urllib user agent, and
> > when the robotparser gets a 401 or 403 response when trying to fetch
> > robots.txt, it is equivalent to "Disallow: *".
> > 
> > http://infix.se/2006/05/17/robotparser
> 
>      That explains it.  It's an undocumented feature of "robotparser",
> as is the 'errcode' variable.  The documentation of "robotparser" is
> silent on error handling (can it raise an exception?) and should be
> updated.

Hi John,
Robotparser is probably following the never-approved RFC for robots.txt 
which is the closest thing there is to a standard. It says, "On server 
response indicating access restrictions (HTTP Status Code 401 or 403) a 
robot should regard access to the site completely restricted."
http://www.robotstxt.org/wc/norobots-rfc.html

If you're interested, I have a replacement for the robotparser module 
that works a little better (IMHO) and which you might also find better 
documented. I'm using it in production code:
http://nikitathespider.com/python/rerp/

Happy spidering

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more



More information about the Python-list mailing list