[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'

Wed Sep 15 23:04:16 CEST 2004

Over the last couple of years, while implementing an RFC 2396 and RFC 2396bis 
compliant URI library for 4Suite, I've amassed a sizable list of, um, 
complaints about urllib.

Many of the issues I have run into are attributable to the age of urllib (I am 
pretty sure it predates the unicode type) and the obsolescence of the specs on 
which parts of it are based (it's essentially in RFC 1808 land, with a 
smattering of patches to bring aspects of it closer to RFC 2396). Other issues 
are matters of API entrenchment, either for the convenience for users (e.g. 
treating '/' and '\' as equivalent on Windows) or for compatibility with the 
APIs of other libraries & applications.

When I'm comfortable enough with 4Suite's Ft.Lib.Uri APIs I intend to formally 
propose incorporating updated implementations into Python core, perhaps 
distributed among urllib, urllib2, and urlparse or maybe in a new module, as 
appropriate. I'm not really ready to make such a proposal, though, as I still 
have some philosophical questions about str/unicode transparency in APIs (e.g. 
urllib.unquote, when given unicode, does not percent-decode characters above 
\u007f, and I'm wondering if that's ideal), and I am also unclear on what the 
policy is regarding using regular expressions in core Python modules -- it 
seems to be a no-no, but I don't know for sure... any comments on that 
particular matter would be appreciated.

Anyway, there's at least one part of Ft.Lib.Uri that I think could stand to be 
addressed more immediately: there is a bit of transformation that one must 
perform on a spec-conformant URI in order to get urllib.urlopen() to process 
it correctly. This should not be necessary, IMHO.

The main issues are:

1. urlopen() cannot reliably process unicode unless there are no
   percent-encoded octets above %7F and no characters above \u007f
   (I think that's the gist of it, at least).

I don't think this is necessarily a bug, as a proper URI will never contain 
non-ASCII characters. However since urlopen()'s API is unfortunately such that 
it accepts OS-specific filesystem paths, which nowadays may be unicode, it may 
be time to tighten up the API and say that the url argument *must* be a URI, 
and that if unicode is given, it will be converted to str and thus must not 
contain non-ASCII characters.

2. urlopen() (the URI scheme-specific openers it uses, actually) does not
   percent-decode the host portion of a URL before doing a DNS lookup.

This wasn't really a problem until IDNs came along; no one was using non-ASCII 
in their hostnames. But now we have to deal with URLs where the host component
is a string of percent-encoded UTF-8 octets, like

    'http://www.%E3%81%BB%E3%82%93%E3%81%A8%E3%81%86%E3%81%AB%E3%81%AA%E3%81%8C%E3%81%84%E3%82%8F%E3%81%91%E3%81%AE%E3%82%8F%E3%81%8B%E3%82%89%E3%81%AA%E3%81%84%E3%81%A9%E3%82%81%E3%81%84%E3%82%93%E3%82%81%E3%81%84%E3%81%AE%E3%82%89%E3%81%B9%E3%82%8B%E3%81%BE%E3%81%A0%E3%81%AA%E3%81%8C%E3%81%8F%E3%81%97%E3%81%AA%E3%81%84%E3%81%A8%E3%81%9F%E3%82%8A%E3%81%AA%E3%81%84.w3.mag.keio.ac.jp/'

which are supposed decoded back to Unicode (in this case, it's a string of
Japanese characters) and then IDNA-encoded for the DNS lookup, so that it will
be interpreted as if it were the equally-unintelligible-but-DNS-friendly

    'http://www.xn--n8jaaaaai5bhf7as8fsfk3jnknefdde3fg11amb5gzdb4wi9bya3kc6lra.w3.mag.keio.ac.jp/'

Even though IDNs are the main application for percent-encoded octets in the
host component, it is necessary in simpler cases as well, like

    'http://www.w%33.org'

which would need to be interpreted as

    'http://www.w3.org'

Python 2.3 introduced an IDNA codec, and both the socket and httplib modules 
were updated to accept unicode hostnames (e.g. the Japanese characters 
represented by, but not shown, in the examples above), automatically applying 
IDNA encoding prior to doing the DNS lookup.

urllib's urlopeners were *not* updated accordingly. This should be changed. 
The way I do it in Ft.Lib.Uri is to rewrite the hostname, regardless of its 
URI scheme (since once I pass it to urlopen it's out of my hands), to a
percent-decoded, IDNA-encoded version before passing it to urlopen. Ideally
it should be handled by each opener as necessary, I think.

3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character, 
   whereas ':' is just as, if not more, common in 'file' URIs.

file:///C:/Windows/notepad.exe is a perfectly valid 'file' URI and should not 
fail to be interpreted on Windows as C:\Windows\notepad.exe. Currently the 
only way to get it to work is to replace the ':' with '|', which was 
established in the days of the Mosaic web browsers, I believe, and that has 
remained as a widely supported, but arbitrary & unnecessary convention.

I would prefer that all the APIs that expect '|' instead of ':' be updated to 
not consider '|' to be canon, but the simplest workaround for the sake of 
using ':'-containing URIs with urllib.urlopen() is just to do a simle string 
replacement in the path, e.g.

    if os.name == 'nt' and scheme == 'file':
        path = path.replace(':','|',1)

(assuming you've already got the path and scheme components of the given URI 
split out).

I would appreciate any comments that anyone has on the feasibility of
these suggestions.

Thanks,

Mike

P.S. If you're curious, the current version of Ft.Lib.Uri is at 

  http://cvs.4suite.org/cgi-bin/viewcvs.cgi/4Suite/Ft/Lib/Uri.py

and a test suite for it (which relies on a custom framework, not unittest,
but that should be fairly understandable anyway) is at

  http://cvs.4suite.org/cgi-bin/viewcvs.cgi/4Suite/test/Lib/test_uri.py

The function that I am currently using to massage a URI to make it safe for 
urllib.urlopen() is named MakeUrllibSafe. I wouldn't recommend it as-is, 
though, since it relies on other functions that deal with more convoluted 
unicode issues that I'm trying to avoid asking about in this post.