[issue3300] urllib.quote and unquote - Unicode issues
Jim Jewett
report at bugs.python.org
Sun Aug 10 02:35:12 CEST 2008
Jim Jewett <jimjjewett at users.sourceforge.net> added the comment:
Matt,
Bill's main concern is with a policy decision; I doubt he would object to
using your code once that is resolved.
The purpose of the quoting functions is to turn a string (representing the
human-readable version) into bytes (that go over the wire). If everything
is ASCII, there isn't any disagreement -- but it also isn't obvious that
they're bytes instead of characters. So people started (well, continued,
since it dates to pre-unicode C) treating them as though they were strings.
The fact that ASCII (and therefore most wire protocols) looks the same as
bytes or as characters was one of the strongest arguments against splitting
the bytes and string types. Now that this has been done, Bill feels we
should be consistent. (You feel wire-protocol bytes should be treated as
strings, if only as bytestrings, because the libraries use them that way --
but this is a policy decision.)
To quote the final paragraph of 1.2.1
"""
In local or regional contexts and with improving technology, users
might benefit from being able to use a wider range of characters;
such use is not defined by this specification. Percent-encoded
octets (Section 2.1) may be used within a URI to represent characters
outside the range of the US-ASCII coded character set if this
representation is allowed by the scheme or by the protocol element in
which the URI is referenced. Such a definition should specify the
character encoding used to map those characters to octets prior to
being percent-encoded for the URI.
"""
So the mapping to bytes (or "octets") for non-ASCII isn't defined (here),
and if you want to use it, you need to specify charset. But in practice,
people do use it without specifying a charset. Which charset should be
assumed? The old code (and test cases) assumed Latin-1. You want to
assume UTF-8 (though you took the document charset when available -- which
might also make sense).
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3300>
_______________________________________
More information about the Python-bugs-list
mailing list