[issue3300] urllib.quote and unquote - Unicode issues

Tue Aug 12 19:38:40 CEST 2008

Bill Janssen <bill.janssen at gmail.com> added the comment:

Larry Masinter is off on vacation, but I did get a brief message saying
that he will dig up similar discussions that he was involved in when he
gets back.

Out of curiosity, I sent a note off to the www-international mailing
list, and received this:

``For the authority (server name) portion of a URI, RFC 3986 is pretty
clear that UTF-8 must be used for non-ASCII values (assuming, for a
moment, that IDNA addresses are not Punycode encoded already). For the
path portion of URIs, a large-ish proportion of them are, indeed, UTF-8
encoded because that has been the de facto standard in Web browsers for
a number of years now. For the query and fragment parts, however, the
encoding is determined by context and often depends on the encoding of
some page that contains the form from which the data is taken. Thus, a
large number of URIs contain non-UTF-8 percent-encoded octets.''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3300>
_______________________________________