Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?
John Nagle
nagle at animats.com
Mon Oct 1 23:31:51 EDT 2007
For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>> import robotparser
>>> url = 'http://wikipedia.org/robots.txt'
>>> chk = robotparser.RobotFileParser()
>>> chk.set_url(url)
>>> chk.read()
>>> testurl = 'http://wikipedia.org'
>>> chk.can_fetch('Mozilla', testurl)
False
>>>
The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers. But the Python
parser doesn't see it that way. No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.
I use "robotparser" on lots of other robots.txt files, and it
normally works. It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?
John Nagle
More information about the Python-list
mailing list