[Baypiggies] urllib.urlencode and encoding

David Reid dreid at dreid.org
Thu Apr 19 06:15:34 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,


On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
> >    When a new URI scheme defines a component that represents textual
> >    data consisting of characters from the Universal Character Set
> > [UCS],
> >    the data should first be encoded as octets according to the UTF-8
> >    character encoding [STD63]; then only those octets that do not
> >    correspond to characters in the unreserved set should be percent-
> >    encoded.  For example, the character A would be represented as "A",
> >    the character LATIN CAPITAL LETTER A WITH GRAVE would be
> > represented
> >    as "%C3%80", and the character KATAKANA LETTER A would be
> > represented
> >    as "%E3%82%A2".

The key piece of information here is "When a new URI scheme" the RFC
(AFAICT) makes no mention of what to do about old schemes, such as
HTTP.  In fact the HTML4 spec makes it's own claims as to %-encoded
as a result of form submission:

http://www.w3.org/TR/html4/interact/forms.html

     accept-charset = charset list [CI]
         This attribute specifies the list of character encodings for
input data that is accepted by the server processing this form. The
value is a space- and/or comma-delimited list of charset values. The
client must interpret this list as an exclusive-or list, i.e., the
server is able to accept any single character encoding per entity
received.
The default value for this attribute is the reserved string
"UNKNOWN". User agents may interpret this value as the character
encoding that was used to transmit the document containing this FORM
element.

So I think it's still incorrect for urllib to make any such
assumptions as to the data being UTF-8. (Though I hope it won't be in
the future.)

- -David
http://dreid.org

"Usually the protocol is this: I appoint someone for a task,
which they are not qualified to do.  Then, they have to fight
a bear if they don't want to do it." -- Glyph Lefkowitz


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGJuzDrsrO6aeULcgRAhdNAJ9VeVkbPXC/eMvOTmEmgWT2vhzoewCgnmbL
ZG5/CIcdtV44ojqefbo+4cw=
=K+T2
-----END PGP SIGNATURE-----


More information about the Baypiggies mailing list