[Python-bugs-list] [ python-Bugs-813986 ] robotparser interactively prompts for username and password

SourceForge.net noreply at sourceforge.net
Mon Sep 29 09:24:08 EDT 2003


Bugs item #813986, was opened at 2003-09-28 15:06
Message generated for change (Comment added) made by calvin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: robotparser interactively prompts for username and password

Initial Comment:
This is a rare occurrence, but if a /robots.txt file is

password-protected on an http server, robotparser

interactively prompts (via raw_input) for a username

and password, because that is urllib's default

behavior.  One example of such a URL, at least at the

time of this writing, is



http://www.cosc.canterbury.ac.nz/robots.txt



Given that robotparser and robots.txt is all about

*robots* (not interactive users), I don't think this

interactive behavior is terribly appropriate.  Attached

is a simple patch to robotparser.py to fix this

behavior, forcing urllib to return the 401 error that

it ought to.



Another issue is whether a 401 (Authorization Required)

URL means that everything should be allowed or

everything should be disallowed.  I'm not sure what's

"right".  Reading the spec, it says 'This file must be

accessible via HTTP on the local URL "/robots.txt"'

which I would read to mean it should be accessible

without username/password.  On the other hand, the

current robotparser.py code says "if self.errcode ==

401 or self.errcode == 403: self.disallow_all = 1"

which has the opposite effect.  I'll leave deciding

which is most appropriate to the powers that be.



----------------------------------------------------------------------

Comment By: Bastian Kleineidam (calvin)
Date: 2003-09-29 15:24

Message:
Logged In: YES 
user_id=9205

http://www.robotstxt.org/wc/norobots-rfc.html specifies the

401 and 403 return code consequences as restricting the

whole site (ie disallow_all = 1).



For the password input, the patch looks good to me. On the

long term, robotparser.py should switch to urllib2.py

anyway, and it should handle Transfer-Encoding: gzip.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470



More information about the Python-bugs-list mailing list