[ python-Bugs-813986 ] robotparser interactively prompts for username and password

Sun Apr 22 23:12:05 CEST 2007

Bugs item #813986, was opened at 2003-09-28 13:06
Message generated for change (Comment added) made by nagle
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Martin v. Löwis (loewis)
Summary: robotparser interactively prompts for username and password

Initial Comment:
This is a rare occurrence, but if a /robots.txt file is
password-protected on an http server, robotparser
interactively prompts (via raw_input) for a username
and password, because that is urllib's default
behavior.  One example of such a URL, at least at the
time of this writing, is

http://www.cosc.canterbury.ac.nz/robots.txt

Given that robotparser and robots.txt is all about
*robots* (not interactive users), I don't think this
interactive behavior is terribly appropriate.  Attached
is a simple patch to robotparser.py to fix this
behavior, forcing urllib to return the 401 error that
it ought to.

Another issue is whether a 401 (Authorization Required)
URL means that everything should be allowed or
everything should be disallowed.  I'm not sure what's
&quot;right&quot;.  Reading the spec, it says 'This file must be
accessible via HTTP on the local URL &quot;/robots.txt&quot;'
which I would read to mean it should be accessible
without username/password.  On the other hand, the
current robotparser.py code says &quot;if self.errcode ==
401 or self.errcode == 403: self.disallow_all = 1&quot;
which has the opposite effect.  I'll leave deciding
which is most appropriate to the powers that be.

----------------------------------------------------------------------

Comment By: John Nagle (nagle)
Date: 2007-04-22 21:12

Message:
Logged In: YES 
user_id=5571
Originator: NO

I tried the patch by doing this:

import robotparser   
def prompt_user_passwd(self, host, realm):
    return None, None
robotparser.URLopener.prompt_user_passwd = prompt_user_passwd    # temp
patch

This dealt with the problem effectively; robots.txt files are being
processed normally, and if reading one causes an authentication request,
it's handled as if no password was input, without any interaction.  

So this could probably go in.

----------------------------------------------------------------------

Comment By: John Nagle (nagle)
Date: 2007-04-21 16:53

Message:
Logged In: YES 
user_id=5571
Originator: NO

The attached patch was never integrated into the distribution.  This is
still broken in Python 2.4 (Win32), Python 2.5 (Win32), and Python 2.5
(Linux).  

This stalled an overnight batch job for us.  Very annoying.

Reproduce with:

import robotparser
url = 'http://mueblesmoraleda.com' # whole site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read()	# Prompts for password

----------------------------------------------------------------------

Comment By: Wummel (calvin)
Date: 2003-09-29 13:24

Message:
Logged In: YES 
user_id=9205

http://www.robotstxt.org/wc/norobots-rfc.html specifies the
401 and 403 return code consequences as restricting the
whole site (ie disallow_all = 1).

For the password input, the patch looks good to me. On the
long term, robotparser.py should switch to urllib2.py
anyway, and it should handle Transfer-Encoding: gzip.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470