[XML-SIG] file urls in urllib

Mark D. Anderson mda@discerning.com
Wed, 7 Mar 2001 04:22:33 -0800


i'm definitely getting academic here, because i think the appropriate handling for windows file:
urls is fairly clear, and they are not handled properly by urllib, while the handling by urllib of unix-style
paths, while not what i consider "right thing", is what everyone else does.

but....

suppose we agree that file:///c:/autoexec.bat should work (this is the case of a collapsed localhost).
then the processing model is that if a url starts with file:/// then remove that prefix, and consider
the remainder (because /c:/autoexec.bat is not a proper local file).
ok, now do that to file:///etc/passwd and you get etc/passwd.
so that means a parser has to look at c:/autoexec.bat and etc/passwd and conclude that because
the first segment looks like a drive letter, it is ok, while etc/passwd needs a leading slash.
if the host slash separator were treated as purely a separator, then this heuristic would not be necessary.

i think it is fair to say that rfc1738 is ambiguous since they only give an mvs example.
but nfs urls are defined clearly to match my "cleaner" notion of purely lexical url processing,
as per http://www.faqs.org/rfcs/rfc2396.html :
   Note that the initial "/" that introduces the <url-path> of an NFS
   URL must not be passed to the server for multi-component lookup since
   the pathname is to be evaluated relative to the public filehandle
   directory.  For example, if the public filehandle is associated with
   the server's directory "/a/b/c" then the URL:
        nfs://server/d/e/f
   will be evaluated with a multi-component lookup of the path
   "d/e/f" relative to the server's directory "/a/b/c" while
   the URL:
        nfs://server//a/b/c/d/e/f
   will locate the same file with an absolute multi-component lookup of
   the path "/a/b/c/d/e/f" relative to the server's filesystem root.
   Notice that a double slash is required at the beginning of the path.

but wait, it gets worse.

we'd like certain functions to "just work" and handle either
a url or a local host path -- this is certainly what we'd like when we specify an
xml source on a command line. 
if so, then we'd also like to sometimes specify in *relative urls* in some of those same
cirmstances. and guess what? relative urls have no leading scheme and therefore
are lexically indistinguishable from some local host paths.
so in that case, if a processor sees etc/passwd, it should *not* add a leading slash,
since it is relative to either current working directory or the current url base, whichever
you like, and should instead look at /usr/etc/passwd or whatever.

so if we'd like to follow the non-rfc convention that a file:foobar url is allowed,
without the net_loc part of the url, then we should say that file:etc/passwd is
a relative url while file:/etc/passwd is absolute.

regardless, i think the policy should be independent of the operating system of the
server. that is, the url file:///c:/autoexec.bat should look for the file c:/autoexec.bat
on unix systems as well. It should be a purely lexical operation.
this is incidentally one of the annoying features of the rfc for imap urls, where in
their infinite wisdom they did not designate a standardized hierarchy separator, nor
even a url parameter to indicate one -- it is entirely up to the server to interpret.
this means no url processing library can do anything with an imap url by itself.

-mda