[Python-bugs-list] [Bug #116716] urllib.quote and Unicode

Thu, 12 Oct 2000 09:14:21 -0700

Bug #116716, was updated on 2000-Oct-12 08:58
Here is a current snapshot of the bug.

Project: Python
Category: Modules
Status: Closed
Resolution: None
Bug Group: Feature Request
Priority: 5
Summary: urllib.quote and Unicode

Details: Currently urllib.quote does not handle
Unicode strings. urllib should be able to handle those.
According to http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
what is required is:

1. Represent each character in UTF-8 (see [RFC2279]) as 
   one or more bytes. 
2. Escape these bytes with the URI escaping mechanism
   (i.e., by converting each byte to %HH, where HH is the
   hexadecimal notation of the byte value). 

urllib.quote already does 2. For Unicode strings it should
do 1. too.

This chances the meaning of urllib.quote slightly, now an
8bit string would be interpreted as being utf8 encoded. To
fix this an 8bit string should be transcoded from the default encoding to utf8 first, i.e. what should be
inserted at the beginning of quote is:
if type(s) == types.StringType:
   s = unicode(s,sys.getdefaultencoding())
s = s.encode("utf8")

Follow-Ups:

Date: 2000-Oct-12 09:13
By: gvanrossum

Comment:
Sure. Added to PEP-42.

I have a feeling that there are probably a lot of places in the standard library where decisions like this may have to be made...!

(Exercise for the reader: code this so that it works with JPython too...)
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=116716&group_id=5470