[Web-SIG] Getting back to WSGI grass roots.

Graham Dumpleton graham.dumpleton at gmail.com
Wed Sep 23 06:43:27 CEST 2009


Sorry, after having had a bit of think while eating lunch, I am going
to throw up another point of view on this whole issue. So, sit back
and be just a little bit concerned.

WSGI stands for Web Server GATEWAY Interface.

My understanding is that right back at the beginning WSGI was purely
intended to only be used at the direct interface with the underlying
web server. This is why I understand, in part at least, the term
'gateway is used in the acronym.

The problem was that people discovered one could apply the same
interface for use as middleware. As we all know, that has been used
quite successfully, but has also been equally abused.

With that in mind, maybe we should start instead to look more at WSGI
being a series of layers.

Yes people have talked about standardised request/response objects,
but I am not thinking at that high of a level.

What I am going to suggest is that there perhaps should still be a
clear line between bytes and unicode.

So, rather than throw away completely the idea of bytes everywhere,
and rewrite the WSGI specification, we could instead say that the
existing conceptual idea of WSGI 1.0 is still valid, and just build on
top of it a translation interface to present that as unicode.

We might still want to respecify WSGI as is now as per the
bytes/unicode/native definitions I explained in my blog post at:

  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

I'd suggest this would possibly be the same or quite similar to my
original definition #2 in the blog post. To save you having to go back
to the blog post, I include it here again.

1. The application is passed an instance of a Python dictionary
containing what is referred to as the WSGI environment. All keys in
this dictionary are native strings. For CGI variables, all names are
going to be ISO-8859-1 and so where native strings are unicode
strings, that encoding is used for the names of CGI variables

2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
environment, the value of the variable should be a native string.

3. For the CGI variables contained in the WSGI environment, the values
of the variables are byte strings.

4. The WSGI input stream 'wsgi.input' contained in the WSGI
environment and from which request content is read, should yield byte
strings.

5. The status line specified by the WSGI application must be a byte string.

6. The list of response headers specified by the WSGI application must
contain tuples consisting of two values, where each value is a byte
string.

7. The iterable returned by the application and from which response
content is derived, must yield byte strings.

By seeing WSGI as being layers instead, first thing is that web
frameworks such as web2py and CherryPy which merely use WSGI as the
gateway interface would continue to work directly on this layer,
regardless of whether they use Python 2.X or 3.X. Those frameworks are
already going to translate what ever this interface defines into their
own internal interface and effectively relegate WSGI from any higher
levels of the application.

We now get back to the unicode vs bytes argument we have been having.
This argument will not vanish by virtue of doing this, but instead of
pushing the unicode translation down into the gateway interface layer,
we just apply it on top.

There is possibly not even a need for the gateway interface layer to
even implement the unicode translation layer, and instead this may
instead be a documented standard convention that any web application
which mounts directly on the gateway interface layer should implement.

The danger in taking this approach is that you now risk having two
types of so called middleware. These are bytes middleware and unicode
middleware. Confusion obviously could come about if you accidentally
mix the two, although some middleware may actually be able to operate
on either bytes or unicode and so not care.

To avoid conflict, one could as a minimal measure just add an
additional 'wsgi.' variable which indicates whether interface is
'bytes' or 'unicode' and hope middleware validate they have been
plugged in at the correct level. Alternatively you change the
interface in some way that they couldn't be plugged together in the
first place.

Some may see this though as the opportunity to introduce a full
request/response object. There is some merit to that as these may
actually want to access the original bytes rather than deal with the
result of the unicode translation layer.

Anyway, that is the thought. Should we be looking at WSGI as a set of
layers instead of assuming we have to push unicode into the gateway
interface layer?

I don't believe this is the same as the prior question of whether WSGI
should be bytes or unicode as we are saying it encompasses both, but
as separate layers. Previously in asking whether should be bytes or
unicode, if the answer was yes to bytes, then the intention was that
unicode would be out of scope and every man and his dog could do it
differently. Here we would still define the unicode layer that would
sit on top of the bytes layer.

If we were to say it is layered, and the gateway interface should
always bytes to the extent of definition #2 above, it would
potentially pave the way for mod_wsgi and CherryPy WSGI servers to be
released in quick order.

Doing this does though in part take the Java approach of punting the
problem up to the next layer. The difference would be that whereas
Java doesn't really define that next translation layer as I understand
what people are saying, we could define it and so at least improve on
things.

FWIW, I thought of this because I was going to suggest at this point
that overall we have a break from the discussion at this point. The
discussion has been robust and useful in helping us uncover the
issues, but I think we are all perhaps starting to get overwhelmed. I
was thus also going to suggest that we setup an area on bitbucket and
start documenting each of the main proposals, along with supplying
reference code which provides a Python 2.X WSGI 1.0 to WSGI Proposal
X.Y that people could actually experiment. Since wsgiref and mod_wsgi
in Python 3.X also basically use same interface, we could also supply
reference code for Python 3.X as well. The point of doing this would
be such that the various proposals were documented concisely and
people could quickly come to understand what they are and compare
rather than have to wade through the mountains of email messages.

It was at this point it occurred to me that since this layering is
possible on top of the original bytes interface for the purposes of
some reference code to demonstrate new interface, that maybe we should
continue treat it as a series of translation layers that build on top
of base raw bytes interface, rather than try and make it monolithic.

Comments?

Graham


More information about the Web-SIG mailing list