Internationalized domain names not working with URLopen

John Nagle nagle at animats.com
Thu Jun 14 00:53:03 EDT 2012


On 6/12/2012 11:42 PM, Andrew Berg wrote:
> On 6/13/2012 1:17 AM, John Nagle wrote:
>> What does "urllib2" want?  Percent escapes?  Punycode?
> Looks like Punycode is the correct answer:
> https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode
>
> I haven't tried it, though.

    This is Python bug #9679:

     http://bugs.python.org/issue9679

It's been open for years, and the maintainers offer elaborate
excuses for not fixing the problem.

The socket module accepts Unicode domains, as does httplib.
But urllib2, which is a front end to both, is still broken.
It's failing when it constructs the HTTP headers.  Domains
in HTTP headers have to be in punycode.

The code in stackoverflow doesn't really work right.  Only
the domain part of a URL should be converted to punycode.
Path, port, and query parameters need to be converted to
percent-encoding.  (Unclear if urllib2 or httplib does this
already.  The documentation doesn't say.)

While HTTP content can be in various character sets, the
headers are currently required to be ASCII only, since the
header has to be processed to determine the character code.
(http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html)

Here's a workaround, for the domain part only.


#
#   idnaurlworkaround  --  workaround for Python defect 9679
#
PYTHONDEFECT9679FIXED = False # Python defect #9679 - change when fixed

def idnaurlworkaround(url) :
     """
     Convert a URL to a form the currently broken urllib2 will accept.
     Converts the domain to "punycode" if necessary.
     This is a workaround for Python defect #9679.
     """
     if PYTHONDEFECT9679FIXED :  # if defect fixed
         return(url)   # use unmodified URL
     url = unicode(url)  # force to Unicode
     (scheme, accesshost, path, params,
     query, fragment) = urlparse.urlparse(url)    # parse URL
     if scheme == '' and accesshost == '' and path != '' : # bare domain
         accesshost = path # use path as access host
         path = '' # no path
     labels = accesshost.split('.') # split domain into sections ("labels")
     labels = [encodings.idna.ToASCII(w) for w in labels]# convert each 
label to punycode if necessary
     accesshost = '.'.join(labels) # reassemble domain
     url = urlparse.urlunparse((scheme, accesshost, path, params, query, 
fragment))  # reassemble url
     return(url) # return complete URL with punycode domain

                 John Nagle



More information about the Python-list mailing list