[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'

Wed Sep 15 23:40:01 CEST 2004

Mike Brown wrote:
> 1. urlopen() cannot reliably process unicode unless there are no
>    percent-encoded octets above %7F and no characters above \u007f
>    (I think that's the gist of it, at least).

And that feature is by design. URLs are conceptually byte strings,
not character strings, so passing Unicode strings is mostly a
meaningless operation. Mostly - because if the Unicode string is
pure ASCII, it probably matches most implementations and user
expectations to convert it to pure ASCII first, and then treat it
as a URL.

IETF is working on resolving the issue, by introducing IRIs. It
appears that draft-duerst-iri-09.txt is what will become the relevant
RFC. Once the RFC is published, urllib and urllib2 should be updated
to support IRIs; contributions are welcome.

> I don't think this is necessarily a bug, as a proper URI will never contain 
> non-ASCII characters. However since urlopen()'s API is unfortunately such that 
> it accepts OS-specific filesystem paths, which nowadays may be unicode, it may 
> be time to tighten up the API and say that the url argument *must* be a URI, 
> and that if unicode is given, it will be converted to str and thus must not 
> contain non-ASCII characters.

No. I'ld rather prefer to specify that it if it is a Unicode string, it
must be an IRI, and is converted to an URI according to the IRI spec.

> 2. urlopen() (the URI scheme-specific openers it uses, actually) does not
>    percent-decode the host portion of a URL before doing a DNS lookup.
> 
> This wasn't really a problem until IDNs came along; no one was using non-ASCII 
> in their hostnames. But now we have to deal with URLs where the host component
> is a string of percent-encoded UTF-8 octets.

Hmm. I think there is no backup in any standard for doing that.
Applications that put URL-escaped UTF-8 bytes into host names deserve to
lose. There are two valid ways for putting non-ASCII characters into the
hostname part of an URL: use Unicode strings, or use IDNA. It may be
that IRIs add another way (I haven't checked this aspect specifically),
but unless there is some RFC supporting such a protocol, any response
by urllib is fine, exceptions preferred.

> Even though IDNs are the main application for percent-encoded octets in the
> host component, it is necessary in simpler cases as well, like
> 
>     'http://www.w%33.org'
> 
> which would need to be interpreted as
> 
>     'http://www.w3.org'

We would have to check: this might be valid usage, but I somewhat doubt
it.

> urllib's urlopeners were *not* updated accordingly. This should be changed. 

The change was deliberately deferred until the IRI RFC is published.

> 3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character, 
>    whereas ':' is just as, if not more, common in 'file' URIs.

I have long ago given up trying to understand this issue. I'm happy to
change this forth and back about once or twice a year, until somebody
comes up with a clear and definitive story, backed up by standards and
product documentation, so that we might get a stable implementation some
day. Feel free to write patches.

Regards,
Martin