[Python-Dev] Some more comments re new uriparse module, patch 1462525

Sat Jun 3 01:19:01 CEST 2006

[Not sure whether this kind of thing is best posted as tracker comments 
(but then the tracker gets terribly long and is mailed out every time a 
change happens) or posted here.  Feel free to tell me I'm posting in the 
wrong place...]

Some comments on this patch (a new module, submitted by Paul Jimenez, 
implementing the rules set out in RFC 3986 for URI parsing, joining URI 
references with a base URI etc.)

http://python.org/sf/1462525

Sorry for the pause, Paul.  I finally read RFC 3986 -- which I must say is 
probably the best-written RFC I've read (and there was much rejoicing).

I still haven't read 3987 and got to grips with the unicode issues 
(whatever they are), but I have just implemented the same stuff you did, 
so have some comments on non-unicode aspects of your implementation (the 
version labelled "v23" on the tracker):

Your urljoin implementation seems to pass the tests (the tests taken from 
the RFC), but I have to I admit I don't understand it :-)  It doesn't seem 
to take account of the distinction between undefined and empty URI 
components.  For example, the authority of the URI reference may be empty 
but still defined.  Anyway, if you're taking advantage of some subtle 
identity that implies that you can get away with truth-testing in place of 
"is None" tests, please don't ;-) It's slower than "is [not] None" tests 
both for the computer and (especially!) the reader.

I don't like the use of module posixpath to implement the algorithm 
labelled "remove_dot_segments".  URIs are not POSIX filesystem paths, and 
shouldn't depend on code meant to implement the latter.  But my own 
implementation is exceedingly ugly ATM, so I'm in no position to grumble 
too much :-)

Normalisation of the base URI is optional, and your urljoin function
never normalises.  Instead, it parses the base and reference, then
follows the algorithm of section 5.2 of the RFC.  Parsing is required
before normalisation takes place.  So urljoin forces people who need
to normalise the URI before to parse it twice, which is annoying.
There should be some way to parse 5-tuples in instead of URIs.  E.g.,
from my implementation:

def urljoin(base_uri, uri_reference):
     return urlunsplit(urljoin_parts(urlsplit(base_uri),
                                     urlsplit(uri_reference)))

It would be nice to have a 5-tuple-like class (I guess implemented as a 
subclass of tuple) that also exposes attributes (.authority, .path, etc.) 
-- the same way module time does it.

The path component is required, though may be empty.  Your parser
returns None (meaning "undefined") where it should return an empty
string.

Nit: Your tests involving ports contain non-digit characters in the
port (viz. "port"), which is not valid by section 3.2.3 of the RFC.

Smaller nit: the userinfo component was never allowed in http URLs,
but you use them in your tests.  This issue is outside of RFC 3986, of
course.

Particularly because the userinfo component is deprecated, I'd rather
that userinfo-splitting and joining were separate functions, with the
other functions dealing only with the standard RFC 3986 5-tuples.

DefaultSchemes should be a class attribute of URIParser

The naming of URLParser / URIParser is still insane :-)  I suggest
naming them _URIParser and URIParser respectively.

I guess there should be no mention of "URL" anywhere in the module --
only "URI" (even though I hate "URI", as a mostly-worthless
distinction from "URL", consistency inside the module is more
important, and URI is technically correct and fits with all the
terminology used in the RFC).  I'm still heavily -1 on calling it
"uriparse" though, because of the highly misleading comparison with
the name "urlparse" (the difference between the modules isn't the
difference between URIs and URLs).

Re your comment on "mailto:" in the tracker: sure, I understand it's not 
meant to be public, but the interface is!  .parse() will return a 4-tuple 
for mailto: URLs.  For everything else, it will return a 7-tuple.  That's 
silly.

The documentation should explain that the function of URIParser is
hiding scheme-dependent URI normalisation.

Method names and locals are still likeThis, contrary to PEP 8.

docstrings and other whitespace are still non-standard -- follow PEP 8
(and PEP 257, which PEP 8 references) Doesn't have to be totally rigid
of course -- e.g. lining up the ":" characters in the tests is fine.

Standard stdlib form documentation is still missing.  I'll be told off
if I don't read you your rights: you don't have to submit in LaTeX
markup -- apparently there are hordes of eager LaTeX markers-up
lurking ready to pounce on poorly-formatted documentation <wink>

Test suite still needs tweaking to put it in standard stdlib form

John