[Web-SIG] My experiences implement WSGI on java/j2ee/jython.

Mon Aug 30 02:32:23 CEST 2004

Dear Web-Sig,

Firstly, I must say, I am totally impressed with the WSGI initiative. At 
first at wasn't clear how such low level structures could improve the 
fragmented situation with python-web frameworks. But now that I've spend 
some time implementing a framework that complies with the spec, I 
understand it a *lot* better, and can see a lot of it's benefits.

Secondly, I must apologise in advance for the length of this post :-)

I decided to write a java/j2ee/jython framework which layers WSGI on top 
of java servlets. I decided this for a number of reasons

  - Because I want WSGI to succeed, and in open-source chances of 
success are greatly enhanced by running code.
  - Because jython needs to be included in WSGI from the ground up.
  - Because cpython and jython should be able to share web components.
  - Because WSGI needs testing against as many server architectures as 
possible.
  - Because the best way to test the quality and usability of a spec is 
to write software that implements it.
  - Because I pray for the day when we can pick and mix capabilities 
from the huge wealth of python web frameworks out there.
  - Because J2EE (i.e. traditional servlets) are sometimes far too 
restrictive, in terms of the way they handle cookies, authorisation, 
etc, and require configuring lots of XML files, which can be a pain: I 
don't like coding in XML, I like coding in python, where I can keep my 
configuration all in an appropriate format.
  - Because I want cpythonistas to keep jython in mind.
  - Because someone had to do it :-), and I do J2EE and jython stuff all 
the time in my work
  - Because WSGI was small enough to implement in a day or two.
  - A load of other good reasons.

My code is not ready for release. I only spent yesterday writing it: 
it's not big, approx 500 lines of java. But I haven't even compiled it 
yet, so it's got loads of syntax errors, no comments, no documentation, 
etc. I expect the compilation and debugging to take a day or two. 
However, I'm ridiculously busy at the moment, and really can't spare 
much time. The fact that I sacrificed my weekend to get jython WSGI 
up-and-running quickly may give you an idea of how important I consider 
the WSGI initiative. I promise I'll release my code by next weekend, 
whatever state it's in. If it's not 100% running, it'll be 90+% running, 
at least.

My design for the moment is really just to show a proof of concept, and 
a bare-bones framework. The framework will simply allow, through 
configuration, the user to map an URL python file, and to specify the 
name of a callalble object within that file, which will obviously be the 
application. Application objects will be cached, based on the filename 
they came from. The request will be dispatched to the application in a 
WSGI compliant way. Simple. For the moment, I'm taking the easy way out, 
in relation to things like threading guarantees. Anything that asks to 
be single-threaded will still use a single instance, but calls from 
multiple threads will be synchronized on that single object, which 
wouldn't really work in a production framework. As WSGI evolves, I'll 
make these kinds of facilities more robust, scalable.

I don't see the point yet in trying to build any more facilities into my 
framework, e.g. url->object mapping, session management, page-template 
management components, authorization, etc. Hopefully, all of these 
facilities will become available as WSGI middleware components, written 
in nice python: not java, or nasty apache conf files, or servlet 
container XML files, blah, blah, blah.

Anyway, while was writing my thing (with printed WSGI spec in hand, 
covered in annotations, tick marks and red ink :-), I came across a few 
points in the spec that I'd like to raise about things that are either 
observations, or things that are incompletely specified, or that induce 
me to misunderstand, or seem just right or wrong.

Also, I've spent today catching up with the web-sig archives, to review 
everyone's comments (now that I'm in a position to understand them), and 
to make sure that I'm not trolling over old ground. So I've added one or 
two points of my own, based on reading those archives. Hopefully some of 
them will be useful.

Lastly, does have anyone have any name suggestions for a 
java/j2ee/jython WSGI-compliant framework? I've been think along the 
lines of "modjy", but I'm open to better ones :-)

So on to the points/questions.

0. On choice of CGI as a basis.
===============================

My experience with J2EE has clearly demonstrated to me that CGI is the 
right choice to base WSGI upon. The J2EE servlet spec has a specific 
method to return every single CGI variable: the specs even mention "this 
method returns the same as the CGI varibale "SCRIPT_NAME", etc. My job 
as "translator" couldn't have been easier. I expect that many other 
containers/frameworks will also support the CGI spec in this way.

1. Default values of environment variables when not present.
============================================================

The spec says that compulsory environment variables, for example 
"CONTENT_LENGTH" or "CONTENT_TYPE", must have a value, i.e. "must be 
present, but may be an empty string, if there is no more appropriate 
value for them". I read "empty string" to mean "".

There are obviously two different choices for how to represent values 
for headers/env-vars that are not present in the request, i.e. 1. an 
empty string as described above or 2. as a python None value. It seems 
more correct to me to use the latter option, None, for when the 
header/env-var is not available, i.e. the client did not send it. This 
allows the use of the "" value to indicate (the admittedly rare and 
malformed case) that the client sent the header name, but did not 
specify a header value. If WSGI uses the empty string for both cases, 
then we lose the ability to distinguish between when the header was sent 
with no value, and when it wasn't sent at all .

I don't think it's a big deal losing that ability, but I could imagine 
that there might be, for example, some security application that might 
like to have access to that information.

For simplicity of the spec, and robustness of servers/apps running on 
WSGI, I understand why it is a good thing to make the default values as 
robust as possible, i.e. in case some app author tries to use a header 
value without checking if it is None first.

I suppose I'm really pointing out a possible wording difficulty in the 
spec, which says "may be an empty string, if there is no more 
appropriate value". To me None is "a more appropriate value" sometimes, 
so I suppose I could legitimately interpret that to mean that I can use 
None values in my WSGI-compliant framework, because my server 
infrastructure allows me to detect their absence or lack of value.

So perhaps either the wording of the spec needs to be tightened up to 
exclude this? Or the default environment values need to be more clearly 
specified? Or perhaps a discussion of None vs. empty string needs to 
added to the Q&A at the end?

2. The SCRIPT_NAME variable.
============================

At first I was a little wary of the SCRIPT_NAME variable, and how I 
would construct it, until I realised that the beginning of the 
URL->Callable mapping is outside the scope of WSGI: it is in the control 
of whichever program/process/container is receiving HTTP requests 
through sockets from the client, and resolving/dispatching them 
according to its configuration files: in my case that was a J2EE 
container, e.g. Tomcat.

The J2EE call that returns a value equivalent to the CGI SCRIPT_NAME 
variable is HTTPServletRequest.getServletPath method. It is an 
interesting note on it which says that "This method will return an empty 
string ("") if the servlet used to process this request was matched 
using the "/*" pattern." Which seems a little odd, until you realise 
that the SCRIPT_NAME = "" case is when the application object is 
responsible for dealing with the entire URL space. Maybe it's worth 
adding a note to this effect in the WSGI spec as well? It helped me 
understand things better.

An idea occurs to me for a nice little reusable WSGI middleware 
component which is a URI mapper, with functionality akin to apache 
mod_rewrite, resolving URIs to python callable's. A lot of frameworks 
like to do things with URL rewriting and mapping, in order to present a 
nice clean URL interface to a tree of objects. Quixote is one such 
framework that likes to have crisp URLs. But much of the time installing 
such frameworks requires configuring apache and invoking mod_rewrite and 
its "cool voodoo" to get the job done. Which can be difficult to debug 
and get working, and scares newbies. (On re-reading the spec, and the 
mailing list, I see I'm not the only one to have thought of such a uri 
mapping component :-)

If I wrote such a reusable mapping component, I could then simply 
configure my entire "container", e.g. Apache, Tomcat, etc, etc, to 
simply resolve all requests for a URL hierarchy to my python component, 
and nice-n-easy python code takes care of it from there, no mod_rewrite 
rules, no complex java servlets mapping algorithm: just python. A big 
win in terms of both installation simplicity and portability, since that 
standard component could then be used across all WSGI frameworks and the 
containers in which they live. I like this WSGI idea :-)

3. Status code and message.
===========================

The WSGI spec states that the status value passed to start_response 
should be of the form "999 Message here". That's fine, I can parse up 
the string easily enough to get the java data types I need to send to 
the container. However, J2EE does not allow me to set the message 
string: I can only set the status code, and that must have an integer 
value.

So, in terms of compliance with WSGI, am I in violation of the WSGI spec 
by not transmitting the actual textual status message specified by the 
application? If that's a problem, there's nothing I can do about it.

I wonder how often this will be the case with other server/container 
frameworks?

4. Binary vs. textual writing.
==============================

Normally, python opens a file in text mode, line-ending translation 
takes place on all python strings written to the file, changing '\n' to 
whatever is the appropriate local line-ending. This is not noticeable on 
*nix, since *nix uses the same line-ending character as python, '\n', so 
no translation is necessary. This means that people running python on 
*nix can write binary data through channels opened in text mode. On 
other platforms though, namely Windows and MacOS, different line-endings 
are used, and python's '\n' gets translated to '\r\n' and '\r' 
respectively. Which corrupts binary files, e.g. .jpg, .gif, if they 
contain '\n'. So Windows and MacOS python users must open files 
explicitly in binary mode if they want to avoid this translation.

It is fundamental requirement (to me at least) that WSGI be able to 
handle writing of binary data. And I'm fairly sure the intention for the 
write() callable in WSGI is that it take python "strings", which 
includes strings of binary data. But perhaps it needs to made explicitly 
clear in the WSGI spec that the write() callable explicitly writes in 
binary mode, i.e. that no translation is taking place on byte strings 
passed to it, and the application/user is responsible for all encoding 
concerns relating to byte strings passed to the write() callable.

5A. Python 2.1 vs. python 2.2: iterators and generators.
========================================================

The WSGI spec says that python 2.2 features are required to be 
compliant. However, it appears to me that the only python 2.2 features 
in use are iterators and generators, used when the application object 
returns an iterator. In fact, it's just that the example in the WSGI 
spec uses a generator (and its corresponding 'yield' keyword): actual 
applications are not required to use a generator: they can also return 
an object that implements the iterator protocol. Which means returning 
an object with a .next() method when the .__iter__() method is called. 
The iterator.next() method keeps returning values, until the iterator 
runs out, in which case it raises StopIteration. Like generators, the 
iterator protocol was also introduced in python 2.2, but they are two 
separate things.

However, even though jython is based on python 2.1, and thus doesn't 
have built-in support for either iterators or generators, I have still 
implemented the iterator protocol in my java/jython framework, by simply 
invoking the .__iter__() and .next() methods on application objects, and 
catching StopIteration exceptions. So I can support components and 
applications returning iterators, and I'm thus compliant with the spec, 
even though I'm running on 2.1. (This is only possible because I'm 
embedding: it is still not possible to support the iterator protocol in, 
say, jython for-loops)

Does the spec need to be changed to reflect this iterators/versioning 
issue? Or to more clearly define the difference between iterators and 
generators?

It's conceivable that even a python 1.5 framework could be programmed to 
support the iterator protocol: it's *very* easy to implement.

5B. A "python.version" WSGI variable?
=====================================

Of course, it will be case that some middleware and applications will 
require to use more advanced and recent (2.2, 2.3, 2.4) language 
features, such as generators, generator expressions, decorators, etc. 
But such components and applications will not be usable under jython, 
which is 2.1. It would be nice for components and applications to have a 
way of knowing what version of python they are running under. Similarly, 
there will jython components and applications that require java 
libraries, and thus won't be usable on cpython of any version.

Would it be useful to define a WSGI variable "python.version", similar 
to "wsgi.version", which gives the python version in effect? In most 
cases under jython, it wouldn't help, because its 2.1 compiler would 
choke when loading python files with newer python syntax anyway, giving 
syntax errors. But it might be useful in some circumstances, perhaps for 
sophisticated dispatchers with the requisite meta-data available to 
them? I'm not sure on this one. Maybe the values of sys.platform and 
os.name give enough information to deal with this problem?

6. Streaming and flushing.
==========================

I see there has been discussion on the list about streaming output and 
flushing. In one message, Philip said "I'm suggesting that write() 
should be guaranteed to either:

    1) Flush all output before returning, or
    2) Put data in a buffer that will be emptied by another thread or by 
the
operating system

To be a conforming implementation, a server/gateway must do one or the
other."

In the J2EE case (and I'm sure with Apache CGI), that's very simple to 
deal with, since the container will do it's own buffering completely 
outside your control, and send the pieces with chunked-transfer encoding 
if necessary. So even if I put a flush on the output channel in my 
framework, I'm only flushing it to the container's buffer: it's still 
not guaranteed to send output back down the return socket to the client.

Just a datapoint.

7. Redirects.
=============

I read some discussion in the lists on how to handle container specific 
facilities, e.g. Apache/mod_python's ability to internally redirect a 
request.

J2EE offers the same capabilities, to internally redirect a request, 
without sending a response back to the client. It happens in a slightly 
different way, because you first ask your container for a dispatcher, 
based on a url, and then call that dispatcher to redirect to the URL. 
And the client may not see any redirect HTTP responses: it's all 
internal to the container.

I see the solution to this redirect platform-dependence problem in the 
implementation of a platform-independent WSGI middleware component that 
takes all responsiblity for redirects. This component examines the 
wsgi.environment present, seeking hints for the optimal way to redirect 
the request: if mod_python is available, use the mopd_python API call: 
if modjy is available, use the getDispatcher(uri).redirect() dance, etc. 
If none of these platform specific techniques are available, it can fall 
back to sending a 302 or 307 response back to the client, and let the 
client re-reqeust the new URL.

If the platform specific techniques are available, their availability 
will be signalled in wsgi.envvars by the presence of variables such 
"mod_python.request" or "modjy.servlet_context", etc. So one 
ultraportable component could do it all (albeit chock full of special 
cases).

Problem solved?

8. Write callable and fileno()
==============================

It is a good idea to check for the fileno() attribute on the write 
callable, since many platforms/frameworks have high-performance ways of 
transferring file contents to sockets, for example. Java 1.4 nio has 
this capability, through the use of directBuffers, memory-mapped files, 
and special natively implemented methods to transfer between the two. 
I'm be surprised if containers like Apache don't support something 
similar. This can drastically improve throughput on static files.

Java objects have "channel"s, or "outputStream"s not "fileno"s. But 
that's an easy problem to fix.

9. Server-detected headers.
===========================

I can see the reason for servers/containers intercepting client headers 
and translating/augmenting/deleting them. However, do we need a 
specification of what to do with certained specified headers? As with 
CGI, should I recognise the "Status: " header or the "Location: " 
header, and translate it to the relevant status code, or do a redirect, 
respectively? If I don't do those translations, won't I be breaking 
reams of python CGI code out there that relies on Apache doing this?

10. The "wsgi.errors" environment variable.
==========================================

Under J2EE, setting the "wsgi.input" variable is easy, I just wrap the 
HttpServletRequest.getInputStream() with an org.python.core.PyFile, and 
bingo.

However, the J2EE HttpServletRequest has no corresponding error stream, 
nor does the corresponding HttpServletResponse paired with each request. 
The only mechanism I can use to send error output is the "sendError(int, 
message)" method of HttpServletResponse. Which allows me to send both an 
integer status code and a textual message, which the J2EE docs say "The 
server defaults to creating the response to look like an HTML-formatted 
server error page containing the specified message, setting the content 
type to "text/html", leaving cookies and other headers unmodified".

So I can't send error output this way without also knowing a status code 
for it as well.

Which makes we wonder what the "wsgi.errors" variable is for? Yes, it's 
for writing error data. But what do we expect to happen to data that 
gets written to it? Will be it wrapped or translated in some way, and 
and used to construct an error response to the user? Or should it be 
locally logged by the server?

I know that this is all J2EE specific stuff, as is confirmed by the rest 
of the documentation sentence I quoted above: "If an error-page 
declaration has been made for the web application corresponding to the 
status code passed in [to the sendError method], it will be served back 
in preference to the suggested msg parameter." WSGI (rightly) has no 
concept of "configured error page declarations", so it would seem the 
"sendError" method is not the right method to use to implement 
"wsgi.errors".

So I'm going to have to treat the error output in some other way, which 
means I need to know more about what it is. Before I can implement a 
jython framework that is fully compliant with the WSGI spec, I need to 
know what will happen to any output send to "wsgi.errors", so that I can 
code for whatever eventualities arise.

Or if it's always to be a framework specific thing, maybe I'll just 
redirect all "wsgi.errors" output to /dev/null, for example? The J2EE 
ServletContext for each servlet has a "log(message)" method. Maybe I 
should just send error output there, in which case it will end in the 
server logs?

That's all for now.

onwards-and-upwards-ly y'rs,

Alan.