[ python-Bugs-918368 ] urllib doesn't correct server returned urls

Wed Mar 31 00:44:33 EST 2004

Bugs item #918368, was opened at 2004-03-17 15:41
Message generated for change (Comment added) made by mike_j_brown
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=918368&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Rob Probin (robzed)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib doesn't correct server returned urls

Initial Comment:
On a URL request where the server returns a URL with spaces in, 
urllib doesn&#039;t correct it before requesting the new page.

I think this is technically a server error, however, it does work 
from web browsers (Mozilla, Safari) but not from Python urllib.

I would suggest that when urllib is following "moved temporarily" 
links (or similar) from a server it translates spaces to %20.

See example program file for more including detailed server/client 
transactions text.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-30 22:44

Message:
Logged In: YES 
user_id=371366

In my previous example, I forgot to add a tilde to the "safe" 
characters. Tilde is an "unreserved" character, a class of 
characters allowed in a URI but for which percent-encoding is 
discouraged.

    newurl = quote(newurl, safe="/:=&?~#+!$,;'@()*[]")

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-30 21:45

Message:
Logged In: YES 
user_id=371366

I agree that it is a server error to put something that doesn't 
meet the syntactic definition of a URI in the Location header 
of a response. I don't see any harm in correcting obvious 
errors, though, in the interest of usability.

As for your proposed fix, instead of just correcting spaces, I 
would do

    newurl = quote(newurl, safe="/:=&?#+!$,;'@()*[]")

quote() is a urllib function that does percent-encoding. It is 
way out of date and does poorly with Unicode strings, but if 
called with the above arguments, it should safely clean up 
most mistakes. The set of additional "safe" characters I am 
passing in is the complete set of "reserved" characters 
according to the latest draft of RFC 2396bis; these are 
characters that definitely do or might possibly have special 
meaning in a URL and thus should not be percent-encoded 
blindly.

I am currently working on a urllib.quote() replacement for 
4Suite's Ft.Lib.Uri library, and will then see about getting the 
improvements folded back into urllib.

----------------------------------------------------------------------

Comment By: Rob Probin (robzed)
Date: 2004-03-18 15:38

Message:
Logged In: YES 
user_id=1000470

I've tested a change to "redirect_internal(self, url, fp, errcode, errmsg, 
headers, data)" in "urllib.py" that adds a single line newurl = 
newurl.replace(" ","%20") after the basejoin() function call that appears 
to fix the problem.

This information is placed in the public domain.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=918368&group_id=5470