[Web-SIG] WSGI 2

Tue Aug 4 02:38:54 CEST 2009

2009/8/4 Ian Bicking <ianb at colorstudy.com>:
> So... what about WSGI 2?  Let's not completely drop the ball on this.
> I *think* we were largely in agreement; debate got distracted by some
> async stuff, but I don't think we particularly have to deal with that
> for WSGI 2.  I think we do more than enough if we figure out: WSGI in
> Python 3, i.e., with unicode; some basic errata kind of stuff, like
> readline signature; change the callable signature to remove
> start_response.
>
> Would this be a new PEP or a revision?  I think it should be a new
> PEP, as WSGI 1 remains valid and the same as it always was, and PEP
> 333 describes that.  Is there anyone willing to make the revisions?

But is the intention to skip straight to WSGI 2.0 for Python 3.0, with
start_response() being eliminated, or are we going to provide amended
WSGI 1.0 for Python 3.0? I can't see how we can avoid the latter and
so we should focus on that first rather that more fundamental changes
in WSGI 2.0.

In respect of WSGI 1.0 for Python 3.0, I have pretty well come to the
conclusion that where we were heading before on that in one area is
wrong. I was about to make changes to mod_wsgi in line with what I
believe should be done and just release it without consultation given
that I couldn't see any discussion reaching any conclusion about it
soon. Since you have sent this email I will try one last time to get a
resolution on WSGI 1.0 for Python 3.0. If can't get one, I guess the
choices are to release the change anyway and provide an incompatible
implementation to what others are guessing should be done, or just rip
all the code out and not support Python 3.0 at all. Either seem
entirely reasonable since there is no WSGI 1.0 specification for
Python 3.0 and the issue again looks to be getting avoided by skipping
to a discussion on WSGI 2.0 instead.

So, for WSGI 1.0 style of interface and Python 3.0, the following is
what I was going to implement.

1. When running under Python 3, applications SHOULD produce bytes
output, status line and headers.

This is effectively what we had before. The only difference is that
clarify that the 'status line' values should also be bytes. This
wasn't noted before. I had already updated the proposed WSGI 1.0
amendments page to mention this.

2. When running under Python 3, servers and gateways MUST accept
strings for output, status line and headers. Such strings must be
converted to bytes output using 'latin-1'. If string cannot be
converted then is treated as an error.

This is again what we had before except that mention 'status line' value.

3. When running under Python 3, servers MUST provide wsgi.input as a
binary (byte) input stream.

No change here.

4. When running under Python 3, servers MUST provide a text stream for
wsgi.errors. In converting this to a byte stream for writing to a
file, the default encoding would be applied.

No real change here except to clarify that default encoding would
apply. Use of default encoding though could be problematic if
combining different WSGI components. This is because each WSGI
component may have been developed on system with different default
encoding and so one may expect to log characters that can't be written
on a different setup. Not sure how you could solve that except to say
people have default encoding be UTF-8 for portability.

5. When running under Python 3, servers MUST provide CGI HTTP and
server variables as strings. Where such values are sourced from a byte
string, be that a Python byte string or C string, they should be
converted as 'UTF-8'. If a specific web server infrastructure is able
to support different encodings, then the WSGI adapter MAY provide a
way for a user of the WSGI adapter to customise on a global basis, or
on a per value basis what encoding is used, but this is entirely
optional. Note that there is no requirement to deal with RFC 2047.

This is where I am going to diverge from what has been discussed before.

The reason I am going to pass as UTF-8 and not latin-1 is that it
looks like Apache effectively only supports use of UTF-8. Since this
means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
even CGI likely cannot handle anything besides UTF-8 then I really
can't see the point of trying to cater for a theoretical possibility
that some HTTP client could use something besides UTF-8. In other
words, the predominant case will be UTF-8, so let us target that.

So, rather than burden every WSGI application with the need to convert
from latin-1 back to bytes and then to UTF-8, let the server deal with
it, with server using sensible default, and where server
infrastructure can handle a different encoding, then it can provide
option to use that encoding and WSGI application doesn't need to
change.

Now, the reason why Apache can't really handle anything besides UTF-8
relates to how filenames are encoded in the file system.

Taking Windows first as it is the more obvious case. What Apache does
there is take whatever path it has mapping to a script file, be it
constructed partially from what is in Apache configuration and
partially from what was supplied in URL from client, and converts it
to UCS2 for passing to Windows file system routines. In converting to
UCS2, Apache assumes that the path will be UTF-8. This means that the
Apache configuration file has to be UTF-8 and that the URL as supplied
by the client is UTF-8 as well after any URL character encoding is
decoded. End result, can only handle UTF-8.

For UNIX systems, Apache doesn't do any conversions of the path, but
passes it direct to file system routines. On a Linux system supporting
UTF-8 file system paths, then that path also need to be UTF-8 and that
again implies that Apache configuration is UTF-8 and client decoded
URL used in matching resource is also UTF-8. Again, by association of
all the moving parts, must all be UTF-8.

Now, what I am talking about here is the file system path constructed
from file system location and some leading prefix of URL and which is
used to match script file. So for URL, this is the SCRIPT_NAME part
where it matches to a file system resource such as a script. Obviously
there is going to be some amount of URL left over, ie., PATH_INFO and
QUERY_STRING. Also shown though that SCRIPT_NAME part has to be UTF-8
and we would really be entering fantasy land if you were somehow going
to cope with some different encoding for PATH_INFO and QUERY_STRING.
Instead it is like the GPL, viral in nature. Use of UTF-8 in one
particular area means you are effectively bound to use UTF-8
everywhere else.

Further example of why UTF-8 reaches into everything is mod_rewrite
module for Apache. This allows you to do stuff related to SCRIPT_NAME,
PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
configuration file has to be UTF-8. If URL isn't, then wouldn't be
possible to perform matches against non latin-1 characters in a
rewrite condition or rule. This is because your match string would be
in different encoded form to that in URL and so wouldn't match.

Now this is all for Apache. Unless they do strange stuff, I would
expect that other web servers such as lighttpd, nginx and Cherokee
would also have this UTF-8 dependence all through it. This would
potentially leave only pure Python web servers that might be able to
handle doing stuff as some other encoding. But although that
technically may be possible, should that, given that anyone wanting to
use a different encoding is likely to be small or non existent,
dictate what should be done for everyone, especially if servers
wanting to handle different encodings could provide a configuration
option to allow it anyway and thus not burden the WSGI application.

In summary, just seems more sane to have stuff in WSGI environment be
dealt with as UTF-8.

So, can we please address this rather than being distracted by WSGI
2.0. The same issue is going to have to be dealt with for WSGI 2.0
anyway, but working it out now means that we can at least deliver a
WSGI 1.0 update for Python 3.0.

Graham