[ python-Bugs-1457264 ] urllib.splithost parses incorrectly

SourceForge.net noreply at sourceforge.net
Sun Mar 26 23:00:34 CEST 2006


Bugs item #1457264, was opened at 2006-03-23 20:49
Message generated for change (Comment added) made by gbrandl
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1457264&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.3
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Steven Willis (onlynone)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib.splithost parses incorrectly

Initial Comment:
urllib.splithost(url) requires that the url passed in
be of the form '//host[:port]/path'. Yet I've run
across some urls that are of the form
'//host[:port]?querystring'. This causes splithost to
return everything as the host and nothing as the path.


Section 3.2 of rfc2396 (Uniform Resource Identifiers:
Generic Syntax) states that 'The authority component is
preceded by a double slash "//" and is terminated by
the next slash "/", question-mark "?", or by the end of
the URI.'

Also, this is how it defines a URI:

absoluteURI   = scheme ":" ( hier_part | opaque_part )
hier_part     = ( net_path | abs_path ) [ "?" query ]
net_path      = "//" authority [ abs_path ]
abs_path      = "/"  path_segments

Based on the above, you could certainly have:
'http://authority?query' as a valid url.


In python2.3 you would just need to change line 939 in
urllib.py from:

        _hostprog = re.compile('^//([^/]*)(.*)$')

to:

        _hostprog = re.compile('^//([^/?]*)(.*)$')

This appears to affect all python versions, I just
happened to be using 2.3.

----------------------------------------------------------------------

>Comment By: Georg Brandl (gbrandl)
Date: 2006-03-26 21:00

Message:
Logged In: YES 
user_id=849994

Fixed in rev. 43330.

----------------------------------------------------------------------

Comment By: Steven Willis (onlynone)
Date: 2006-03-24 17:12

Message:
Logged In: YES 
user_id=1299996

The problem I was having specifically was that the url had a
colon in the query string. Since the query string was being
parsed as part of the host, the text after the colon was
being treated as the port when urllib.splitport was called
later. The following is a simple testcase:

import urllib2
webpage = urllib2.urlopen("http://host.com?a=b:3b")

You will then get a "httplib.InvalidURL: nonnumeric port: '3b'"

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1457264&group_id=5470


More information about the Python-bugs-list mailing list