Python "robots.txt" parser broken since 2003
Nikita the Spider
NikitaTheSpider at gmail.com
Sun Apr 22 11:56:38 EDT 2007
In article <FvtWh.11824$YL5.6282 at newssvr29.news.prodigy.net>,
John Nagle <nagle at animats.com> wrote:
> This bug, "[ 813986 ] robotparser interactively prompts for username and
> password", has been open since 2003. It killed a big batch job of ours
> last night.
>
> Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
> If the server asks for basic authentication on that file, "robotparser"
> prompts for the password on standard input. Which is rarely what you
> want. You can demonstrate this with:
>
> import robotparser
> url = 'http://mueblesmoraleda.com' # this site is password-protected.
> parser = robotparser.RobotFileParser()
> parser.set_url(url)
> parser.read() # Prompts for password
>
> That's the tandard, although silly, "urllib" behavior.
John,
robotparser is (IMO) suboptimal in a few other ways, too.
- It doesn't handle non-ASCII characters. (They're infrequent but when
writing a spider which sees thousands of robots.txt files in a short
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to
pass up to the caller.
I wrote my own parser to address these problems. It probably suffers
from the same urllib hang that you've found (I have not encountered it
myself) and I appreciate you posting a fix. Here's the code &
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/
Cheers
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
More information about the Python-list
mailing list