[Web-SIG] WSGI 2

Tue Aug 4 06:00:44 CEST 2009

At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote:
>1. When running under Python 3, applications SHOULD produce bytes
>output, status line and headers.
>
>This is effectively what we had before. The only difference is that
>clarify that the 'status line' values should also be bytes. This
>wasn't noted before. I had already updated the proposed WSGI 1.0
>amendments page to mention this.

+1

>2. When running under Python 3, servers and gateways MUST accept
>strings for output, status line and headers. Such strings must be
>converted to bytes output using 'latin-1'. If string cannot be
>converted then is treated as an error.
>
>This is again what we had before except that mention 'status line' value.
>
>3. When running under Python 3, servers MUST provide wsgi.input as a
>binary (byte) input stream.
>
>No change here.
>
>4. When running under Python 3, servers MUST provide a text stream for
>wsgi.errors. In converting this to a byte stream for writing to a
>file, the default encoding would be applied.
>
>No real change here except to clarify that default encoding would
>apply. Use of default encoding though could be problematic if
>combining different WSGI components. This is because each WSGI
>component may have been developed on system with different default
>encoding and so one may expect to log characters that can't be written
>on a different setup. Not sure how you could solve that except to say
>people have default encoding be UTF-8 for portability.

Also +1.

>5. When running under Python 3, servers MUST provide CGI HTTP and
>server variables as strings. Where such values are sourced from a byte
>string, be that a Python byte string or C string, they should be
>converted as 'UTF-8'. If a specific web server infrastructure is able
>to support different encodings, then the WSGI adapter MAY provide a
>way for a user of the WSGI adapter to customise on a global basis, or
>on a per value basis what encoding is used, but this is entirely
>optional. Note that there is no requirement to deal with RFC 2047.
>
>This is where I am going to diverge from what has been discussed before.
>
>The reason I am going to pass as UTF-8 and not latin-1 is that it
>looks like Apache effectively only supports use of UTF-8. Since this
>means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
>even CGI likely cannot handle anything besides UTF-8 then I really
>can't see the point of trying to cater for a theoretical possibility
>that some HTTP client could use something besides UTF-8. In other
>words, the predominant case will be UTF-8, so let us target that.
>
>So, rather than burden every WSGI application with the need to convert
>from latin-1 back to bytes and then to UTF-8, let the server deal with
>it, with server using sensible default, and where server
>infrastructure can handle a different encoding, then it can provide
>option to use that encoding and WSGI application doesn't need to
>change.

Maybe I'm missing something here, but what if Apache receives 
something encoded in Latin-1?  AFAIR, form POST encoding is 
determined by the encoding of the page containing the form; that's of 
course something that only happens in the input body, but what about URLs?

Mainly I'm wondering, what should the server do in the event they 
receive a byte string which is not valid UTF-8?  (Latin-1 doesn't 
have this problem, since there's no such thing as an invalid Latin-1 
string, at least not at the encoding level.)

>Also shown though that SCRIPT_NAME part has to be UTF-8
>and we would really be entering fantasy land if you were somehow going
>to cope with some different encoding for PATH_INFO and QUERY_STRING.
>Instead it is like the GPL, viral in nature. Use of UTF-8 in one
>particular area means you are effectively bound to use UTF-8
>everywhere else.

I'm not clear on your logic here.  If I request foo/bar/baz (where 
baz actually has an accent over the 'a') in latin-1 encoding, and 
foo/bar is the script, then the (accented) baz is legitimate for 
pass-through to the application, no?

I just tried testing this with Firefox and Apache, and found that you 
can in fact pass such Latin-1 strings through to PATH_INFO, but at 
least in the case of Firefox, you have to %-escape them.  However, 
they are seen by Python (via os.environ) as latin-1 encoded byte strings.

>Further example of why UTF-8 reaches into everything is mod_rewrite
>module for Apache. This allows you to do stuff related to SCRIPT_NAME,
>PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
>configuration file has to be UTF-8. If URL isn't, then wouldn't be
>possible to perform matches against non latin-1 characters in a
>rewrite condition or rule. This is because your match string would be
>in different encoded form to that in URL and so wouldn't match.

Note that this still doesn't have any impact on the bytes that 
actually reach the application, which can be non-UTF8.  At minimum, 
the proposal is underspecified as to how to handle this case, which 
is as trivial to generate as sticking a %-escape in the PATH_INFO or 
QUERY_STRING portion(s) of a URL.