[Baypiggies] urllib.urlencode and encoding

Keith Dart ♂ keith at dartworks.biz
Thu Apr 19 02:08:13 CEST 2007


Tung Wai Yip wrote the following on 2007-04-18 at 16:51 PDT:
===
> urllib.urlencode() cannot encode unicode string itself. RFC 2396 has not  
> taken unicode into consideration. So there is no rule on what to do with  
> unicode in an URI. It is up to the application to decide on the encoding,  
> e.g. UTF-8 first, url encoding next. Others might very well choose to use  
> UTF-16 instead.

===

Nope, see RFC 3986:


Network Working Group                                     T. Berners-Lee
Request for Comments: 3986                                       W3C/MIT
STD: 66                                                      R. Fielding
Updates: 1738                                               Day Software
Obsoletes: 2732, *2396*, 1808                                


Section 2.5:

   When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.  For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".



-- 
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Keith Dart <keith at dartworks.biz>
   public key: ID: 19017044
   <http://www.dartworks.biz/>
   =====================================================================



More information about the Baypiggies mailing list