Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

John Nagle nagle at animats.com
Mon Oct 1 23:31:51 EDT 2007


    For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:

 >>> import robotparser
 >>> url = 'http://wikipedia.org/robots.txt'
 >>> chk = robotparser.RobotFileParser()
 >>> chk.set_url(url)
 >>> chk.read()
 >>> testurl = 'http://wikipedia.org'
 >>> chk.can_fetch('Mozilla', testurl)
False
 >>>

    The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers.  But the Python
parser doesn't see it that way.  No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

    I use "robotparser" on lots of other robots.txt files, and it
normally works.  It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

				John Nagle



More information about the Python-list mailing list