urllib.unquote and unicode

Thu Dec 21 15:29:49 EST 2006

>>> The way that uri encoding is supposed to work is that first the input
>>> string in unicode is encoded to UTF-8 and then each byte which is not in
>>> the permitted range for characters is encoded as % followed by two hex
>>> characters. 
>> Can you back up this claim ("is supposed to work") by reference to
>> a specification (ideally, chapter and verse)?
> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.

If somebody implemented IRIs, that would be an entirely different
matter.

Regards,
Martin