urlparse.urlparse bug - misparses long URL

Fri Dec 14 11:43:42 EST 2007

John Nagle wrote:
> Matt Nordhoff wrote:
>> John Nagle wrote:
>>> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
>>> ====
> ...
> 
>>
>> It's breaking on the first slash, which just happens to be very late in
>> the URL.
>>
>>>>> urlparse('http://example.com?blahblah=http://example.net')
>> ('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
> 
> That's what it seems to be doing:
> 
> sa1 = 'http://example.com?blahblah=/foo'
> sa2 = 'http://example.com?blahblah=foo'
> print urlparse.urlparse(sa1)
> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
> print urlparse.urlparse(sa2)
> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT
> 
> That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic 
> Syntax"), page 23 says
> 
>    "The characters slash ("/") and question mark ("?") may represent data
>    within the query component.  Beware that some older, erroneous
>    implementations may not handle such data correctly when it is used as
>    the base URI for relative references (Section 5.1), apparently
>    because they fail to distinguish query data from path data when
>    looking for hierarchical separators."
> 
> So "urlparse" is an "older, erroneous implementation".  Looking
> at the code for "urlparse", it references RFC1808 (1995), which
> was a long time ago, three revisions back.
> 
> Here's the bad code:
> 
> def _splitnetloc(url, start=0):
>     for c in '/?#': # the order is important!
>         delim = url.find(c, start)
>         if delim >= 0:
>             break
>     else:
>         delim = len(url)
>     return url[start:delim], url[delim:]
> 
> That's just wrong.  The domain ends at the first appearance of
> any character in '/?#', but that code returns the text before the
> first '/' even if there's an earlier '?'.  A URL/URI doesn't
> have to have a path, even when it has query parameters. 

     "urlparse" doesn't use regular expressions.  Is there some good
reason for that?  It would be easy to fix the code above with a
regular expression to break on any char in '/?#'.  But
urlparse would have to import "re".  Is that undesirable?

					John Nagle