robotparser behavior on 403 (Forbidden) robot.txt files

"Martin v. Löwis" martin at v.loewis.de
Mon Jun 2 17:01:45 EDT 2008


>   I just discovered that the "robotparser" module interprets
> a 403 ("Forbidden") status on a "robots.txt" file as meaning
> "all access disallowed". That's unexpected behavior.

That's specified in the "norobots RFC":

http://www.robotstxt.org/norobots-rfc.txt

- On server response indicating access restrictions (HTTP Status
  Code 401 or 403) a robot should regard access to the site
  completely restricted.

So if a site returns 403, we should assume that it did so
deliberately, and doesn't want to be indexed.

>  A major site ("http://www.aplus.net/robot.txt") has their
> "robots.txt" file set up that way.

You should try "http://www.aplus.net/robots.txt" instead,
which can be accessed just fine.

Regards,
Martin



More information about the Python-list mailing list