[Python-Dev] urllib.quote and unquote - Unicode issues

Thu Jul 31 06:01:56 CEST 2008

On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca <matt.giuca at gmail.com> wrote:
>
>> Con: URI encoding does not encode characters.
>
> OK, for all the people who say URI encoding does not encode characters: yes
> it does. This is not an encoding for binary data, it's an encoding for
> character data, but it's unspecified how the strings map to octets before
> being percent-encoded. From RFC 3986, section 1.2.1:
>
>> Percent-encoded octets (Section 2.1) may be used within a URI to represent
>> characters outside the range of the US-ASCII coded character set if this
>> representation is allowed by the scheme or by the protocol element in which
>> the URI is referenced.  Such a definition should specify the character
>> encoding used to map those characters to octets prior to being
>> percent-encoded for the URI.
>
> So the string->string proposal is actually correct behaviour. I'm all in
> favour of a bytes->string version as well, just not with the names "quote"
> and "unquote".
>
> I'll prepare a new patch shortly which has bytes->string and string->bytes
> versions of the functions as well. (quote will accept either type, while
> unquote will output a str, there will be a new function unquote_to_bytes
> which outputs a bytes - is everyone happy with that?)

I'd rather have two pairs of functions, so that those who want to give
the readers of their code a clue can do so. I'm not opposed to having
redundant functions that accept either string or bytes though, unless
others prefer not to.

> Guido says:
>>
>> Actually, we'd need to look at the various other APIs in Py3k before we
>> can decide whether these should be considered taking or returning bytes or
>> text. It looks like all other APIs in the Py3k version of urllib treat URLs
>> as text.
>
> Yes, as I said in the bug tracker, I've groveled over the entire stdlib to
> see how my patch affects the behaviour of dependent code. Aside from a few
> minor bits which assumed octets (and did their own encoding/decoding) (which
> I fixed), all the code assumes strings and is very happy to go on assuming
> this, as long as the URIs are encoded with UTF-8, which they almost
> certainly are.

Sorry, I have yet to look at the tracker (only so many minutes in a day...).

> Guido says:
>>
>> I think the only change is to remove the encoding arguments and ...
>
> You really want me to remove the encoding= named argument? And hard-code
> UTF-8 into these functions? It seems like we may as well have the optional
> encoding argument, as it does no harm and could be of significant benefit.
> I'll post a patch with the unquote_to_bytes function, but leave the encoding
> arguments in until this point is clarified.

I don't mind an encoding argument, as long as it isn't used to change
the return type (as Bill was proposing).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)