urlparse.urlparse bug - misparses long URL - FIX

Fri Dec 14 12:49:05 EST 2007

John Nagle wrote:
> John Nagle wrote:
>> Matt Nordhoff wrote:
>>> John Nagle wrote:
>>>> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
>>>> ====
>> ...
>>
>>>
>>> It's breaking on the first slash, which just happens to be very late in
>>> the URL.
>>>
>>>>>> urlparse('http://example.com?blahblah=http://example.net')
>>> ('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
>>
>> That's what it seems to be doing:
>>
>> sa1 = 'http://example.com?blahblah=/foo'
>> sa2 = 'http://example.com?blahblah=foo'
>> print urlparse.urlparse(sa1)
>> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
>> print urlparse.urlparse(sa2)
>> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT
>>
>> That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic 
>> Syntax"), page 23 says
>>
>>    "The characters slash ("/") and question mark ("?") may represent data
>>    within the query component.  Beware that some older, erroneous
>>    implementations may not handle such data correctly when it is used as
>>    the base URI for relative references (Section 5.1), apparently
>>    because they fail to distinguish query data from path data when
>>    looking for hierarchical separators."
>>
>> So "urlparse" is an "older, erroneous implementation".  Looking
>> at the code for "urlparse", it references RFC1808 (1995), which
>> was a long time ago, three revisions back.
>>
>> Here's the bad code:
>>
>> def _splitnetloc(url, start=0):
>>     for c in '/?#': # the order is important!
>>         delim = url.find(c, start)
>>         if delim >= 0:
>>             break
>>     else:
>>         delim = len(url)
>>     return url[start:delim], url[delim:]
>>
>> That's just wrong.  The domain ends at the first appearance of
>> any character in '/?#', but that code returns the text before the
>> first '/' even if there's an earlier '?'.  A URL/URI doesn't
>> have to have a path, even when it has query parameters. 

OK, here's a fix to "urlparse", replacing _splitnetloc.  I didn't use
a regular expression because "urlparse" doesn't import "re", and I didn't
want to change that.

def _splitnetloc(url, start=0):
	delim = len(url)# position of end of domain part of url, default is end
	for c in '/?#':	# look for delimiters; the order is NOT important	
		wdelim = url.find(c, start)	# find first of this delim
		if wdelim >= 0:			# if found
			delim = min(delim, wdelim)# use earliest delim position
	return url[start:delim], url[delim:]	# return (domain, rest)

I'll put this in the tracker once I can get back in; password changes still go
through SourceForge, even though the tracker isn't there.

Note: the unit test in "urlparse" fails in the standard Python 2.4 version.
So the unit test needs fixing.  Also, some of the bad cases above should
be added to the unit test.

					John Nagle
					SiteTruth