[Web-SIG] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

Tue Sep 21 18:09:44 CEST 2010

While the Web-SIG is trying to hash out PEP 444, I thought it would 
be a good idea to have a backup plan that would allow the Python 3 
stdlib to move forward, without needing a major new spec to settle 
out implementation questions.

After all, even if PEP 333 is ultimately replaced by PEP 444, it's 
probably a good idea to have *some* sort of WSGI 1-ish thing 
available on Python 3, with bytes/unicode and other matters settled.

In the past, I was waiting for some consensuses (consensi?) on 
Web-SIG about different approaches to Python 3, looking for some sort 
of definite, "yes, we all like this" response.  However, I can see 
now that this just means it's my fault we don't have a spec yet.    :-(

So, unless any last-minute showstopper rebuttals show up this week, 
I've decided to go ahead officially bless nearly all of what Graham 
Dumpleton (who's not only the mod_wsgi author, but has put huge 
amounts of work into shepherding WSGI-on-Python3 proposals, WSGI 
amendments, etc.) has proposed, with a few minor exceptions.

In other words: almost none of the following is my own original work; 
it's like 90% Graham's.  Any praise for this belongs to him; the only 
thing that belongs to me is the blame for not doing this 
sooner!  (Sorry Graham.  You asked me to do this ages ago, and you were right.)

Anyway, I'm posting this for comment to both Python-Dev and the 
Web-SIG.  If you are commenting on the technical details of the 
amendments, please reply to the Web-SIG only.  If you are commenting 
on the development agenda for wsgiref or other Python 3 library 
issues, please reply to Python-Dev only.  That way, neither list will 
see off-topic discussions.  Thanks!

The Plan
========

I plan to update the proposal below per comments and feedback during 
this week, then update PEP 333 itself over the weekend or early next 
week, followed by a code review of Python 3's wsgiref, and 
implementation of needed changes (such as recoding os.environ to 
latin1-captured bytes in the CGI handler).

To complete the changes, it is possible that I may need assistance 
from one or more developers who have more Python 3 experience.  If 
after reading the proposed changes to the spec, you would like to 
volunteer to help with updating wsgiref to match, please let me know!

The Proposal
============

Overview
--------

1. The primary purpose of this update is to provide a uniform porting 
pattern for moving Python 2 WSGI code to Python 3, meaning a pattern 
of changes that can be mechanically applied to as little code as 
practical, while still keeping the WSGI spec easy to programmatically 
validate (e.g. via ``wsgiref.validate``).

The Python 3 specific changes are to use:

* ``bytes`` for I/O streams in both directions
* ``str`` for environ keys and values
* ``bytes`` for arguments to start_response() and write()
* text stream for wsgi.errors

In other words, "strings in, bytes out" for headers, bytes for bodies.

In general, only changes that don't break Python 2 WSGI 
implementations are allowed.  The changes should also not break 
mod_wsgi on Python 3, but may make some Python 3 wsgi applications 
non-compliant, despite continuing to function on mod_wsgi.

This is because mod_wsgi allows applications to output string headers 
and bodies, but I am ruling that option out because it forces every 
piece of middleware to have to be tested with arbitrary combinations 
of strings and bytes in order to test compliance.  If you want your 
application to output strings rather than bytes, you can always use a 
decorator to do that.  (And a sample one could be provided in wsgiref.)

2. The secondary purpose of the update is to address some 
long-standing open issues documented here:

    http://www.wsgi.org/wsgi/Amendments_1.0

As with the Python 3 changes, only changes that don't retroactively 
invalidate existing implementations are allowed.

3. There is no tertiary purpose.  ;-)  (By which I mean, all other 
kinds of changes are out-of-scope for this update.)

4. The section below labeled "A Note On String Types" is proposed for 
verbatim addition to the "Specification Overview" section in the PEP; 
the other sections below describe changes to be made inline at the 
appropriate part of the spec, and changes that were proposed but are 
rejected for inclusion in this amendment.

A Note On String Types
----------------------

In general, HTTP deals with bytes, which means that this 
specification is mostly about handling bytes.

However, the content of those bytes often has some kind of textual 
interpretation, and in Python, strings are the most convenient way to 
handle text.

But in many Python versions and implementations, strings are Unicode, 
rather than bytes.  This requires a careful balance between a usable 
API and correct translations between bytes and text in the context of 
HTTP...  especially to support porting code between Python 
implementations with different ``str`` types.

WSGI therefore defines two kinds of "string":

* "Native" strings (which are always implemented using the type named ``str``)

* "Bytestrings" (which are implemented using the ``bytes`` type in 
Python 3, and ``str`` elsewhere)

So, even though HTTP is in some sense "really just bytes", there are 
many API conveniences to be had by using whatever Python's default 
``str`` type is.

Do not be confused however: even if Python's ``str`` is actually 
Unicode under the hood, the *content* of a native string is still 
restricted to bytes!  See the section on `Unicode Issues`_ later in 
this document.

In short: where you see the word "string" in this document, it refers 
to a "native" string, i.e., an object of type ``str``, whether it is 
internally implemented as bytes or unicode.  Where you see references 
to "bytestring", this should be read as "an object of type ``bytes`` 
under Python 3, or type ``str`` under Python 2".

Clarifications (To be made in-line)
-----------------------------------

The following amendments are clarifications to parts of the existing 
spec that proved over the years to be ambiguous or insufficiently 
specified, as well as some attempts to correct practical errors.

(Note: many of these issues cannot be completely fixed in WSGI 1 
without breaking existing implementations, and so the text below has 
notations such as "(MUST in WSGI 2)" to indicate where any 
replacement spec for WSGI 1 should strengthen them.)

* If an application returns a body iterator, a server (or middleware) 
MAY stop iterating over it and discard the remainder of the output, 
as long as it calls any close() method provided by the 
iterator.  Applications returning a generator or other custom 
iterator SHOULD NOT assume that the entire iterator will be 
consumed.  (This change makes it explicit that caching middleware or 
HEAD-processing servers can throw away the response body.)

* start_response() SHOULD (MUST in WSGI 2) check for errors in the 
status or headers at the time it's called, so that an error can be 
raised as close to the problem as possible

* If start_response() raises an error when called normally (i.e. 
without exc_info), it SHOULD be an error to call it a second time 
without passing exc_info

* The SERVER_PORT variable is of type str, just like any other CGI 
environ variable.  (According to the WSGI wiki, "some 
implementations" expect it to be an integer, even though there is 
nothing in the WSGI spec that allows a CGI variable to be anything but a str.)

* A server SHOULD (MUST in WSGI 2) support the size hint argument to 
readline() on its wsgi.input stream.

* A server SHOULD (MUST in WSGI 2) return an empty bytestring from 
read() on wsgi.input to indicate an end-of-file condition.  (In WSGI 
2, language should be clarified to allow the input stream length and 
CONTENT_LENGTH to be out of sync, for reasons explained in Graham's blog post.)

* A server SHOULD (MUST in WSGI 2) allow read() to be called without 
an argument, and return the entire remaining contents of the stream

* If an application provides a Content-Length header, the server 
SHOULD NOT (MUST NOT in WSGI 2) send more data to the client than was 
specified in that header, whether via write(), yielded body 
bytestrings, or via a wsgi.file_wrapper.  (This rule applies to 
middleware as well.)

* wsgi.errors is a text stream accepting "native strings"

Rejected Amendments
-------------------

* Manlio Perillo's suggestion to allow header specification to be 
delayed until the response iterator is producing non-empty 
output.  This would've been a possible win for async WSGI, but could 
require substantial changes to existing servers.