[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Fri Sep 17 09:54:21 CEST 2004

"Martin v. L> >     If RFC 1808 applies (the current implementation is based largely
> >     on this spec, which did not clearly distinguish between a reference
> >     and a URI), it is what is defined in the grammar as a URL, and
> >     if it is relative (relativeURL in the grammar), it is considered
> >     to be relative to a default base URL.
> 
> This is troublesome. What is a meaningful base URL? This should be 
> mentioned prominently.

In effect, this is what happens in the current implementation, but I don't 
think it was ever anyone's intent to think of it in terms of standards-based 
resolution-to-absolute-form against a base URL, and in any event, it's not 
as well-documented as it should be.

User expectation in most contexts, even when it doesn't apply (as in the most 
prominent use of relative references: HTML/XML document processing) is that 
relative references are relative to a base having something to do with the 
current working directory of the URL processor. Wrong as it often is to make 
such an assumption, in the case of urlopen() we have no context that would 
define a base URL. The documented precedent is that the 'file' scheme is 
assumed, and the implementation, IIRC, is such that the relative path is run 
through url2pathname which does very little to it, and it is then passed right 
to open(), so in effect the current working directory is assumed.

For the sake of having a sane policy going forward, I would rather see the 
behavior expressed in terms that would be governed by standards, which is what 
I attempted to do. Luckily, the behavior is such that it is possible.

There is an issue though: if disallowed/non-ASCII characters or bytes are in 
the urlopen() argument, and it's a relative URL, then right now the 
implementation is (I think) such that those characters or bytes pass through 
unchanged to the open() call. So if we do anything to these characters/bytes 
beforehand, such as %-encoding them as I think you were suggesting (see 
previous email), then for compatibility we'd have to specify that we're 
%-decoding them again in a way that results in the original characters/bytes 
being passed to open().

> >     (This mostly describes current behavior, assuming we can reach
> >     agreement that the "C:" in the example above should be treated
> >     no differently than "C|").
> 
> I have no problem with that. There are no one-letter URL schemata,
> are there?

There aren't, although in principle I wish the API weren't lenient;
people would quickly learn that C:\x\y\z is not a URL and C:/x/y/z is
only allowed by the standards to be interpreted in one way: the one
they probably don't want, and what they really need to do is learn to
use file:///blahblahblah.

In 4Suite's Ft.Lib.Uri we needed to conduct strictly conformant processing of 
URI references in our DOM, XPath, XSLT, and HTTP implementations. I found that 
we couldn't use urllib for hardly anything of this sort without a great deal 
of working around / closing up the holes opened by all these 'conveniences'.

Tightening up the conformance issues meant that we needed to help users 
produce valid URIs from filesystem paths and vice-versa. Once again, the core 
Python libs were of little use -- pathname2url and url2pathname are 
platform-dependent, and are so full of bugs^H^H^H^Hfeatures that I had to 
start from scratch and roll my own functions. I think what I've got at this 
point would make great additions to urllib2, but I'll save them for another 
day...

At least with all the "OKs" you've given so far, I can submit a patch or three 
to get some of the documentation updated.

> > I must attend to other things right now; will comment on the other issues 
> > later.
> 
> Take your time. This has been sitting around for many releases - one
> more or less doesn't matter much in the global flow of things :-)

Heh, agreed. I wish rfc2396bis and IRIs would hurry on through the IETF's 
machinery. I've only been actively paying attention to the former, but they
both have a lot going for them.