[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'

Thu Sep 16 02:10:17 CEST 2004

"Martin v. Löwis" wrote:
> Mike Brown wrote:
> > 1. urlopen() cannot reliably process unicode unless there are no
> >    percent-encoded octets above %7F and no characters above \u007f
> >    (I think that's the gist of it, at least).
> 
> And that feature is by design. URLs are conceptually byte strings,
> not character strings, so passing Unicode strings is mostly a
> meaningless operation.

No. The intent is actually that a URI is (not conceptually, just *is*) a 
string of characters; the syntax is only defined in terms of bytes due to 
peculiarities of the grammar. A percent-encoded sequence conceptually 
represents an encoded character, or part of one in the case of multibyte 
encodings, that may or may not be allowed by the syntax to appear as a literal 
character in that part of the URI.

This was actually clear in RFC 2396 sections 1.5 and 2, but has been explained 
somewhat better in the rephrased section 2 of rfc2396bis, which is in Last 
Call.

As for what was by design, the fact that a unicode url arg fails relatively 
deep in the processing (generally when it gets handed to urllib.unquote) or a 
resolver, and that it isn't ASCII-fied first, and that this isn't documented, 
and that urlopen() seems to be designed to be a URL-or-filepath-opener, all 
seems to indicate to me that this 'design' isn't very deliberate.

> Mostly - because if the Unicode string is
> pure ASCII, it probably matches most implementations and user
> expectations to convert it to pure ASCII first, and then treat it
> as a URL.

Well, we can take it for granted that an object that purports to be a URI must 
consist only of characters from a limited subset of ASCII. If the object is 
unicode, then there is no ambiguity about what each item in the sequence 
means, it's just a character and it must be in the allowed set in order to be 
interpreted unambiguously, so unicode is actually the ideal type of argument 
to urlopen(). If the object is a byte str, then we can pretty much assume that 
each byte represents its ASCII equivalent and is subject to the same 
restrictions, although this should be documented, lest someone pass in a UCS-2 
or UTF-16 string expecting its characters to be magically decoded.

The question is, does the url argument to urlopen() purport to be or is it 
assumed to be a URL? The function is quite lenient about what it accepts as a 
URL -- it accepts pretty much anything you give it, be it unicode or str, with 
or without a scheme component, relative to some unknown base, and loaded with 
illegal characters, and it tries to deal with it as best it can -- yet it 
still rejects or inconsistently handles some valid URIs, and this is what I 
want to see changed.

Perhaps I should rephrase part of the issue this way: If the argument to 
urlopen() is assumed to be a URI, then %FF in the argument should not be 
interpreted any differently when the argument is a str vs when it is unicode. 
RFC 2396 left it ambiguous as to what characters are represented by %80-%FF, 
so an implementation thereof may make such interpretations as it pleases.
The current implementation doesn't do this in a consistent manner.

> IETF is working on resolving the issue, by introducing IRIs. It
> appears that draft-duerst-iri-09.txt is what will become the relevant
> RFC. Once the RFC is published, urllib and urllib2 should be updated
> to support IRIs; contributions are welcome.
>
> > I don't think this is necessarily a bug, as a proper URI will never contain 
> > non-ASCII characters. However since urlopen()'s API is unfortunately such that 
> > it accepts OS-specific filesystem paths, which nowadays may be unicode, it may 
> > be time to tighten up the API and say that the url argument *must* be a URI, 
> > and that if unicode is given, it will be converted to str and thus must not 
> > contain non-ASCII characters.
> 
> No. I'd rather prefer to specify that it if it is a Unicode string, it
> must be an IRI, and is converted to an URI according to the IRI spec.

OK, that's probably a good way to go about it.

You should note however that percent-encoded sequences are legal in IRIs and 
pass through unchanged in the conversion to URI, so this does not solve the 
problem of how they are interpreted (i.e. the %80-%FF pass-through in certain 
situations). In an IRI that you construct yourself, you are much less likely 
to ever see a percent-encoded octet, but nevertheless, being a superset of 
URI, any IRI may contain them.

> > 2. urlopen() (the URI scheme-specific openers it uses, actually) does not
> >    percent-decode the host portion of a URL before doing a DNS lookup.
> > 
> > This wasn't really a problem until IDNs came along; no one was using non-ASCII 
> > in their hostnames. But now we have to deal with URLs where the host component
> > is a string of percent-encoded UTF-8 octets.
> 
> Hmm. I think there is no backup in any standard for doing that.

OK, you're right; it was in an IETF draft of its own (draft-uri-idn-something) 
and in February of this year was folded into rfc2396bis. How IDNs are 
represented in URIs is indeed currently restricted to IDNA (RFC 3490) only, by 
virtue of the fact that RFC 2396 forbids percent-encoding in hostnames.

I sometimes forget which aspects of rfc2396bis are changes from RFC 2396 and 
its predecessors, and which are clarifications / bugfixes.

> Applications that put URL-escaped UTF-8 bytes into host names deserve to
> lose.

Come February or whenever rfc2396bis and the IRI draft become RFCs, that
will no longer be a position you can maintain.

> There are two valid ways for putting non-ASCII characters into the
> hostname part of an URL: use Unicode strings, or use IDNA. It may be
> that IRIs add another way (I haven't checked this aspect specifically)

They do by virtue of reference to "RFCYYYY" which is a placeholder for
the RFC that the rfc2396bis draft will become, pending approval.

> but unless there is some RFC supporting such a protocol, any response
> by urllib is fine, exceptions preferred.

Consider it a feature request then.

> > urllib's urlopeners were *not* updated accordingly. This should be changed. 
> 
> The change was deliberately deferred until the IRI RFC is published.

OK.

> > 3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character, 
> >    whereas ':' is just as, if not more, common in 'file' URIs.
> 
> I have long ago given up trying to understand this issue. I'm happy to
> change this forth and back about once or twice a year, until somebody
> comes up with a clear and definitive story, backed up by standards and
> product documentation, so that we might get a stable implementation some
> day. Feel free to write patches.

OK, a few points to understand:

- There is no canonical form of 'file' URI for any OS path.
  All conventions are established by implementations.

- 'file' as a URL scheme is very vaguely specified.
  It is being revised now but the revision may not be any better,
  from what I've seen so far on the mailing list for it.

- No RFC disallows ":" in the path component of any URL,
  except when it needs to appear in the first segment of the path
  component of what is now called a relative URI reference, when
  that path component is hierarchical (as determined by the scheme).
  In that situation, the segment must be prepended with './' in order
  to ensure that it is interpreted correctly.

Thus 'C:/autoexec.bat' as a URI reference (like in an href in an HTML doc) 
must be interpreted as scheme 'C' (not 'file'), and (by RFC 2396) 
non-hierarchical path '/autoexec.bat' or (by rfc2396bis) authority/hostname 
autoexec.bat, path ''. In either case it shouldn't be resolvable.

Meanwhile, './C:/autoexec.bat' is scheme <inherited from base URI>,
authority <inherited from base URI>, path './C:/autoexec.bat',
which is much less ambiguous.

Using '|' allows one to write 'C|/autoexec.bat' as a relative URI reference,
but that is, as far as I can tell, the only advantage to using it.

Let me be clear though - I am not suggesting getting rid of support for '|'.
I am merely saying that there is no reason ':' should, on Windows, fail to
be treated the same as '|' for the purpose of representing the ':' in a
drivespec.