[Python-Dev] urlparse brokenness

Mike Brown mike at skew.org
Mon Nov 28 06:07:08 CET 2005


Guido van Rossum wrote:
> IIRC I did it this way because the RFC about parsing urls specifically
> prescribed it had to be done this way.

That was true as of RFC 1808 (1995-1998), although the grammar actually 
allowed for a more generic interpretation. 

Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular 
expression for parsing URI 'references' (a formal abstraction introduced in 
2396) into 5 components (not six, since 'params' were moved into 'path'
and eventually became an option on every path segment, not just the end
of the path). The 5 components are:

  scheme, authority (formerly netloc), path, query, fragment.

Parsing could result in some components being undefined, which is distinct 
from being empty (e.g., 'mailto:foo at bar?' would have an undefined authority 
and fragment, and a defined, but empty, query).

RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes 
several references to these '5 major components' of a URI, and says that these 
components are scheme-independent; parsers that operate at the generic syntax
level "can parse any URI reference into its major components. Once the scheme
is determined, further scheme-specific parsing can be performed on the
components."

> You have to know what the scheme means before you can
> parse the rest -- there is (by design!) no standard parsing for
> anything that follows the scheme and the colon.

Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI 
references can be interpreted as having the 5 components, it was made explicit 
in RFC 3986 / STD 66.

> I don't even think
> that you can trust that if the colon is followed by two slashes that
> what follows is a netloc for all schemes.

You can.

> But if there's an RFC that says otherwise I'll gladly concede;
> urlparse's main goal in life is to b RFC compliant.

Its intent seems to be to split a URI into its major components, which are now 
by definition scheme-independent (and have been, implicitly, for a long time), 
so the function shouldn't distinguish between schemes.

Do you want to keep returning that 6-tuple, or can we make it return a 
5-tuple? If we keep returning 'params' for backward compatibility, then that 
means the 'path' we are returning is not the 'path' that people would expect 
(they'll have to concatenate path+params to get what the generic syntax calls 
a 'path' nowadays). It's also deceptive because params are now allowed on all 
path segments, and the current function only takes them from the last segment.

Also for backward compatibility, should an absent component continue to 
manifest in the result as an empty string? I think a compliant parser should 
make a distinction between absent and empty (it could make a difference, in 
theory).

If a regular expression were used for parsing, it would produce None for 
absent components and empty-string for empty ones. I implemented it this
way in 4Suite's Ft.Lib.Uri and it works nicely.

Mike


More information about the Python-Dev mailing list