Problem with Python's "robots.txt" file parser in module robotparser
John Nagle
nagle at animats.com
Wed Jul 11 12:57:56 EDT 2007
Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...
the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be incorrect.
This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.
The spec for "robots.txt", at
http://www.robotstxt.org/wc/norobots.html
says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved." That suggests that "//" should only disallow
paths beginning with "//".
John Nagle
SiteTruth
More information about the Python-list
mailing list