urlparse.urlparse bug - misparses long URL
John Nagle
nagle at animats.com
Fri Dec 14 11:43:42 EST 2007
John Nagle wrote:
> Matt Nordhoff wrote:
>> John Nagle wrote:
>>> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
>>> ====
> ...
>
>>
>> It's breaking on the first slash, which just happens to be very late in
>> the URL.
>>
>>>>> urlparse('http://example.com?blahblah=http://example.net')
>> ('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
>
> That's what it seems to be doing:
>
> sa1 = 'http://example.com?blahblah=/foo'
> sa2 = 'http://example.com?blahblah=foo'
> print urlparse.urlparse(sa1)
> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
> print urlparse.urlparse(sa2)
> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT
>
> That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
> Syntax"), page 23 says
>
> "The characters slash ("/") and question mark ("?") may represent data
> within the query component. Beware that some older, erroneous
> implementations may not handle such data correctly when it is used as
> the base URI for relative references (Section 5.1), apparently
> because they fail to distinguish query data from path data when
> looking for hierarchical separators."
>
> So "urlparse" is an "older, erroneous implementation". Looking
> at the code for "urlparse", it references RFC1808 (1995), which
> was a long time ago, three revisions back.
>
> Here's the bad code:
>
> def _splitnetloc(url, start=0):
> for c in '/?#': # the order is important!
> delim = url.find(c, start)
> if delim >= 0:
> break
> else:
> delim = len(url)
> return url[start:delim], url[delim:]
>
> That's just wrong. The domain ends at the first appearance of
> any character in '/?#', but that code returns the text before the
> first '/' even if there's an earlier '?'. A URL/URI doesn't
> have to have a path, even when it has query parameters.
"urlparse" doesn't use regular expressions. Is there some good
reason for that? It would be easy to fix the code above with a
regular expression to break on any char in '/?#'. But
urlparse would have to import "re". Is that undesirable?
John Nagle
More information about the Python-list
mailing list