[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Fri Sep 17 07:39:11 CEST 2004

"Martin v. Löwis" wrote:
> > Are we in agreement on these points?
> 
> I think I have to answer "no". The % notation is not a quirk of the BNF.

That's not what I said at *all*.  The quirk of the BNF is a completely 
separate issue, and is this: BNF mandates that its terminals are integers, 
e.g. character ":" in a particular BNF-based grammar represents the value 58 
(in decimal). RFC 2396 makes use of the grammar to define the generic syntax, 
but stipulates (well, rfc2396bis clarifies that the intent was to stipulate) 
that the intent is to actually define the syntax in terms of characters, so 
the ":" in the grammar really does mean the colon character, in that spec.

So there is no disagreement there, really.

> >  -  A URL/URI consists of a finite sequence of Unicode characters;
> 
> No. An URI contains of a finite sequence of characters.

You are correct. This is stated in RFC 2396, and Martin Duerst and I pushed 
for rfc2396bis to settle upon a definition of character just to make it extra 
clear, so I should have known better.

> >  -  If given unicode, each character in the string directly represents
> >     a character in the URL/URI and needs no interpretation;
> 
> No. Only ASCII characters in the string need no interpretation. For
> non-ASCII characters, urllib needs to assume some escaping mechanism.

Err, no. Let me start over. The question is: what do we do with a unicode 
object given as the 'url' argument in urllib.urlopen(), etc.?

Assumption 1:
  Resolution to absolute form and subsequent dereferencing of a
  character sequence that is intended to identify a resource,
  in order to be performed in a manner that is conformant with
  [pick one: RFC 1630, RFC 1738, RFC 1808, RFC 2396, the RFC that
  rfc2396bis will likely become, or the RFC that the IRIs draft will
  likely become], requires that the character sequence actually *be*
  [depending on which spec you chose] a URL, a URI reference, or 
  an IRI reference. Those standards do not define how to resolve &
  dereference other types of resource identifiers, be they character
  sequences or otherwise.

Assumption 2:
  The aforementioned standards unambiguously define the syntax to which a
  resource-identifying character sequence must conform in order to be
  considered a URL, a URI reference, or an IRI reference. The standards
  do not define how character sequences that do not conform to the syntax
  can be processed (but they do not forbid such processing; they just say
  that they aren't applicable to those situations).

Assumption 3:
  When an argument is given to an RFC 1808-era URL resolution function
  that is documented as requiring that the argument be [an object that
  represents] a 'URL', then the caller implicitly asserts that whatever
  object passed indeed represents a URL.

Assumption 4:
  The object passed into the function, of course, is going to manifest
  relatively concretely, as, say, a Python str or unicode object, so
  the function, if it intends to perform standards-conformant resolution,
  must behave as if it has interpreted the object as a resource-identifying
  sequence of abstract characters, and must verify somehow that the sequence
  adheres to the syntax requirements of a URL / URI ref / IRI ref. This
  verification can either be an explicit syntax check, or can be a feature
  of the conversion of the object as resource-identifying characters.

In either case, we need to define the mechanics of that conversion. This 
is what I am attempting to unambiguously do for str and unicode arguments
by saying how each item in a str or unicode object maps to the characters
that are going to be treated as a URL/URI ref.

It is true that we are under no obligation in our API to assume a one-to-one 
mapping between the characters in a unicode argument and the characters in the 
resource-identifying string that, in turn, may or may not be a URL, but to do 
otherwise seems a bit unintuitive, to me. You seem to be suggesting that a 
one-to-one mapping be assumed until a syntax error is found. Then, if the 
syntax error is of a certain type (like the character is > U+007F, then you 
seem to be saying that you want some kind of cleanup to be performed in order 
to ensure that the resulting string is conformant to the URL syntax.

I feel that since urllib is under no obligation to assume anything about what 
the syntax-violating characters are intended to mean, it would be within its 
rights to reject the argument altogether, and I would rather see it do that 
than try to guess what the user intended -- especially in this domain, where 
such guesses, if wrong, only lead developers to be even more confused about 
topics that are already barely understood as it is.

For example, some specs (HTML, XHTML, XSLT) suggest that processors of those 
types of documents perform UTF-8 based percent-encoding of any non-ASCII 
characters that mistakenly appear in attribute values that are normally 
supposed to contain URI references (hrefs and the like). Users who rely on 
this then wonder why many widely-deployed HTTP servers/CGI/PHP apps, etc. -- 
the ones that assume %-encoded octets in the Request-URI are iso-8859-1 based 
-- misinterpret the characters. To me, convenience afforded by the automatic
percent-encoding is outweighed by the harm introduced by the wrong guesses
and the reinforcement of the belief in the document author or developer that
a URI reference is whatever string of characters they want it to be.

I have a feeling this is a matter of personal philosophy. I've never been a 
huge fan of the "be lenient in what you accept, strict in what you produce" 
mantra. URLs/URIs have a strict syntax, and IMHO we should enforce it so that 
developers can learn about and code to standards, rather than becoming reliant 
upon the crutch of lenient-yet-convenient APIs.

But if we are going to accept arbitrary strings and then attempt to make 'em 
fit the URL syntax, then we should, IMHO, acknowledge (in API documentation) 
that this is behavior provided for the sake of having a convenient API, and is 
not within the scope of the standards. Hopefully the marginal percentage of 
developers who actually read the API docs can then learn that 
u'http://m.v.l\xd6wis/' is not a URL, even if urllib happens to convert it to 
one, and in my perfect fantasy-world, they'd be less inclined to give us any
reason to make lenient APIs. Actually, in a perfect world I probably would
not be inclined to obsess over such things :)

-Mike