urllib.unquote and unicode

Duncan Booth duncan.booth at invalid.invalid
Fri Dec 22 04:13:29 EST 2006


"Martin v. Löwis" <martin at v.loewis.de> wrote:

>>>> The way that uri encoding is supposed to work is that first the
>>>> input string in unicode is encoded to UTF-8 and then each byte
>>>> which is not in the permitted range for characters is encoded as %
>>>> followed by two hex characters. 
>>> Can you back up this claim ("is supposed to work") by reference to
>>> a specification (ideally, chapter and verse)?
>> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
> 
> Thanks.
and thanks from me too.

> Unfortunately, this isn't normative, but "we recommend". In
> addition, it talks about URIs found HTML only. If somebody writes
> a user agent written in Python, they are certainly free to follow
> this recommendation - but I think this is a case where Python should
> refuse the temptation to guess.

So you believe that because something is only recommended by a standard 
Python should refuse to implement it? This is the kind of thinking that in 
the 1980's gave us a version of gcc where any attempt to use #pragma (which 
according to the standard invokes undefined behaviour) would spawn a copy 
of nethack or rogue.

You don't seem to have realised yet, but my objection to the behaviour of 
urllib.unquote is precisely that it does guess, and it guesses wrongly. In 
fact it guesses latin1 instead of utf8. If it threw an exception for non-
ascii values, then it would match the standard (in the sense of not 
following a recommendation because it doesn't have to) and it would be 
purely a quality of implementation issue.

If you don't believe me that it guesses latin1, try it. For all valid URIs 
(i.e. ignoring those with non-ascii characters already in them) in the 
current implementation where u is a unicode object:

   unquote(u)==unquote(u.encode('ascii')).decode('latin1')

I generally agree that Python should avoid guessing, so I wouldn't really 
object if it threw an exception or always returned a byte string even 
though the html standard recommends using utf8 and the uri rfc requires it 
for all new uri schemes. However, in this case I think it would be useful 
behaviour: e.g. a decent xml parser is going to give me back the attributes 
including encoded uris in unicode. To handle those correctly you must 
encode to ascii before unquoting. This is an avoidable pitfall in the 
standard library.

On second thoughts, perhaps the current behaviour is actually closer to:

    unquote(u)==unquote(u.encode('latin1')).decode('latin1')

as that also matches the current behaviour for uris which contain non-ascii 
characters when the characters have a latin1 encoding. To fully conform 
with the html standard's recommendation it should actually be equivalent 
to:

    unquote(u)==unquote(u.encode('utf8')).decode('utf8')

The catch with the current behaviour is that it doesn't exactly mimic any 
sensible behaviour at all. It decodes the escaped octets as though they 
were latin1 encoded, but it mixes them into a unicode string so there is no 
way to correct its bad guess. In other words the current behaviour is 
actively harmful.



More information about the Python-list mailing list