urllib.unquote and unicode

Tue Dec 19 04:08:59 EST 2006

"Leo Kislov" <Leo.Kislov at gmail.com> wrote:

> George Sakkis wrote:
>> The following snippet results in different outcome for (at least) the
>> last three major releases:
>>
>> >>> import urllib
>> >>> urllib.unquote(u'%94')
>>
>> # Python 2.3.4
>> u'%94'
>>
>> # Python 2.4.2
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position
>> 0: ordinal not in range(128)
>>
>> # Python 2.5
>> u'\x94'
>>
>> Is the current version the "right" one or is this function supposed
>> to change every other week ?
> 
> IMHO, none of the results is right. Either unicode string should be
> rejected by raising ValueError or it should be encoded with ascii
> encoding and result should be the same as
> urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can
> consider current behaviour as undefined just like if you pass a random
> object into some function you can get different outcome in different
> python versions.

I agree with you that none of the results is right, but not that the 
behaviour should be undefined.

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters. 

That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec. 

Unfortunately RFC3986 isn't entirely clear-cut on this issue:

>    When a new URI scheme defines a component that represents textual
>    data consisting of characters from the Universal Character Set [UCS],
>    the data should first be encoded as octets according to the UTF-8
>    character encoding [STD63]; then only those octets that do not
>    correspond to characters in the unreserved set should be percent-
>    encoded.  For example, the character A would be represented as "A",
>    the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
>    as "%C3%80", and the character KATAKANA LETTER A would be represented
>    as "%E3%82%A2".

I think it leaves open the possibility that existing URI schemes which do 
not support unicode characters can use other encodings, but given that the 
original posting started by decoding a unicode string I think that utf-8 
should definitely be assumed in this case.

Also, urllib.quote() should encode into utf-8 instead of throwing KeyError 
for a unicode string.