[Web-SIG] WSGI 2

Henry Precheur henry at precheur.org
Fri Aug 14 07:36:28 CEST 2009


On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote:
> Correct -- you can write any set of % encodings, and I don't think it even
> has to be able to validly url-decode (e.g., /foo%zzz will work).  It
> definitely doesn't have to be a valid encoding.  However, if you actually
> include unicode characters, they will always be encoded as UTF-8 (as goes
> with the IRI standard).  This is in a case like <a href="/some page">, the
> browser will request /some%20page, because it escapes unsafe characters.
>  Similarly if you request <a href="/fran??ais"> it will encode that ?? in
> UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at
> least on Firefox.  I used this to test:
> http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py

I have run some tests regarding the encoding issue:

curl doesn't 'url-encode' its URLs:

  curl 'http://hostname/français'
                            ^
                         <e7> latin-1 character

The latin-1 character is send to the server. Lighttpd accepts the URL
and even return a file if it exists. Of course if I try with the same
characters in UTF-8 it doesn't work.

AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that
libcurl is quite popular (it used to be the transport library of
Webkit/GTK+ for example.) It's hard to discard it as a utterly broken &
obscure tool. Many 'simplistic' HTTP clients may have the same problem.


Now let's talk a little bit about cookies...

Cookies can contain whatever 'binary junk' the server send. RFC 2965
says (http://tools.ietf.org/html/rfc2965#page-5):

> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.

Also, cookies can contain 'comments' which contains UTF-8 strings.
(http://tools.ietf.org/html/rfc2965#page-6):

> Characters in value MUST be in UTF-8 encoding.

Firefox has no problem with cookies containing non-ASCII characters. It
looks like it assumes cookies are encoded using latin-1, since latin-1
characters are displayed correctly in Firebug, but not UTF-8 ones.


Cheers,

-- 
  Henry Prêcheur


More information about the Web-SIG mailing list