Python "robots.txt" parser broken since 2003

Sun Apr 22 11:56:38 EDT 2007

In article <FvtWh.11824$YL5.6282 at newssvr29.news.prodigy.net>,
 John Nagle <nagle at animats.com> wrote:

> This bug, "[ 813986 ] robotparser interactively prompts for username and 
> password", has been open since 2003.  It killed a big batch job of ours
> last night.
> 
> Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
> If the server asks for basic authentication on that file, "robotparser"
> prompts for the password on standard input.  Which is rarely what you
> want.  You can demonstrate this with:
> 
> import robotparser
> url = 'http://mueblesmoraleda.com' # this site is password-protected.
> parser = robotparser.RobotFileParser()
> parser.set_url(url)
> parser.read()	# Prompts for password
> 
> That's the tandard, although silly, "urllib" behavior.

John,
robotparser is (IMO) suboptimal in a few other ways, too. 
- It doesn't handle non-ASCII characters. (They're infrequent but when 
writing a spider which sees thousands of robots.txt files in a short 
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to 
pass up to the caller.

I wrote my own parser to address these problems. It probably suffers 
from the same urllib hang that you've found (I have not encountered it 
myself) and I appreciate you posting a fix. Here's the code & 
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more