[Python-bugs-list] [ python-Bugs-548176 ] urlparse doesn't handle host?bla

noreply@sourceforge.net noreply@sourceforge.net
Sun, 17 Nov 2002 08:56:22 -0800


Bugs item #548176, was opened at 2002-04-24 10:36
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470

Category: Python Library
Group: Python 2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Markus Demleitner (msdemlei)
Assigned to: Nobody/Anonymous (nobody)
Summary: urlparse doesn't handle host?bla

Initial Comment:
The urlparse module (at least in 2.2 and 2.1, Linux)
doesn't
handle URLs of the form
http://www.maerkischeallgemeine.de?loc_id=49 correctly
-- everything up to the 9 ends up in the host.  I
didn't check the RFC, but in the real world URLs like
this do show up.  urlparse works fine when there's a
trailing slash on the host name:
http://www.maerkischeallgemeine.de/?loc_id=49

Example:
<pre>
>>> import urlparse
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49")
('http', 'www.maerkischeallgemeine.de', '/', '',
'loc_id=49', '')
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49")
('http', 'www.maerkischeallgemeine.de?loc_id=49', '',
'', '', '')
</pre>

This has serious implications for urllib, since
urllib.urlopen will fail for URLs like the second one,
and with a pretty mysterious exception ("host not
found") at that.

----------------------------------------------------------------------

Comment By: Jeff Epler (jepler)
Date: 2002-11-17 10:56

Message:
Logged In: YES 
user_id=2772

This actually appears to be permitted by RFC2396
[http://www.ietf.org/rfc/rfc2396.txt].   See section 3.2:


3.2. Authority Component

   Many URI schemes include a top hierarchical element for a
naming authority, such that the namespace defined by the
remainder of the URI is governed by that authority.  This
authority component is typically defined by an
Internet-based server or a scheme-specific registry of
naming authorities.

      authority     = server | reg_name

   The authority component is preceded by a double slash
"//" and is terminated by the next slash "/", question-mark
"?", or by the end of the URI.  Within the authority
component, the characters ";", ":", "@", "?", and "/" are
reserved.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470