[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Fri Sep 17 11:02:39 CEST 2004

"Martin v. Löwis" wrote:
> > It is true that we are under no obligation in our API to assume a one-to-one 
> > mapping between the characters in a unicode argument and the characters in the 
> > resource-identifying string that, in turn, may or may not be a URL, but to do 
> > otherwise seems a bit unintuitive, to me.
> 
> Not at all. If the URI contains the sequence '%A0',
> does that constitute one or three characters?

Yes, it does. :)

I think I've got this right:

%A0 in a URI is three characters in the URI. Together they are representing 
one octet (byte A0) in much the same way that the 6 characters &#232; 
represents a single small-e-with-acute character in ISO/IEC 10646-based markup 
languages.

If the sequence were %00-%7F, then the octet represented by that sequence 
would in turn represent a single character in the ASCII range, and you would 
be allowed to use equivalence rules and knowledge of the syntax in order to 
ascertain whether the sequence is interchangeable with the raw character at 
that position in the URI.

But since in this example it is %80-%FF, the octet represented by the sequence 
does not automatically represent a character; it represents, at best, a 
scheme- or implementation-defined code unit which may or may not be an encoded 
character or portion thereof.

> You suggested earlier that the host part of an
> URI could be UTF-8 encoded. In that case, a single character translates
> into, say, 2 octets, which then get %-escaped, translating into 6 ASCII
> characters. So a single Unicode character may end up in multiple ASCII
> characters during processing.

That sounds right, but I think I need to an example to understand where the 
disagreement is. It's not a URI at the point where it contains a non-ASCII
character.

Theoretical resolution procedure of argument u'http://m.v.l\xf6wis/':

  arg         u'http://m.v.l\xf6wis/'
  => IRI ref  u'http://m.v.l\xf6wis/'
  => URI ref  u'http://m.v.l%C3%B6wis/'

and likewise, just for example,

  arg         u'http://m.v.l%C3%B6wis/'
  => IRI ref  u'http://m.v.l%C3%B6wis/'
  => URI ref  u'http://m.v.l%C3%B6wis/'

In any event, the argument has become the URI reference
u'http://m.v.l%C3%B6wis/' (which we don't need to necessarily store
as unicode, but I prefer to write it as such for clarity):

  1. Resolve to absolute form (necessary even with absolute refs
     in order to eliminate dot segments in the path; the rfc2396bis
     algorithm is preferable to the buggy ones in older specs for this).

     The base URI will be based on os.getcwd(). We'll say cwd is
     '/home/mike/test' to keep it simple. Base URI then is
     u'file:///home/mike/test'. Resolution to absolute form results
     in, in this case, no change: the URI represented by the URI ref
     is the same as the ref itself: u'http://m.v.l%C3%B6wis/'.

  2. URI is decomposed into its components:
       scheme: u'http'
       authority: u'm.v.l%C3%B6wis'
       path: u'/'
       query: undefined
       fragment: undefined

  3. Fragment, if any, is stripped prior to dereference, per specs.

  4. For http scheme, authority is split into:
       user: undefined
       pass: undefined
       host: u'm.v.l%C3%B6wis'
       port: u'80' (default)

  5. host is percent-decoded with a UTF-8 basis:
       host: u'm.v.l\xf6wis'

  6. socket object is obtained for host
     u'm.v.l\xf6wis' and port 80 (int);
     socket module applies IDNA encoding and does DNS lookup of
     'm.v.xn--lwis-5qa', connects to corresponding IP address on port 80

  7. properly formatted HTTP request message (a byte string)
     is sent for Request-URI '/' with Host header 'Host: m.v.xn--lwis-5qa'

If the initial argument were a byte string, I agree that any non-ASCIIs
should be percent-encoded directly. Processing would then be conducted
exactly as above.

-Mike