[Web-SIG] Re: Bill's comments on WSGI draft 1.4

Thu Sep 2 05:25:56 CEST 2004

At 06:07 PM 9/1/04 -0700, Bill Janssen wrote:

>1.  The "environ" parameter must be a Python dict: I think subclasses
>should be allowed.  A true subclass supports all methods of its
>ancestors, so the rationale presented in the back of the PEP for
>excluding them doesn't hold water.  I think the appropriate check
>would be to see if the returned class is a subclass of the "dict"
>class.  That is, "isinstance(e, dict)" should return True.

Paradoxically, allowing subclasses eliminates the usefulness of allowing 
subclasses.  Presumably, the purpose of using a subclass is to provide some 
extended behavior, e.g. as an attribute/method, or as a byproduct of 
requesting particular keys or values.  In both cases, these extended 
behaviors would be destroyed the minute that a piece of middleware decides 
to use its *own* dictionary subclass.

This also ignores the issue that creating a dictionary subclass that 
*consistently* enforces some extended behavior (e.g. lazy evaluation of a 
key) is intrinsically difficult and fragile, because new versions of Python 
often introduce new dictionary methods that are not implemented in terms of 
other existing methods, thus breaking a previously "perfect" subclass when 
a new Python version is released.

These are "practicality beats purity" argument, so I need to see some 
*practical* applications of dictionary subclasses that would be useful 
enough to outweigh both of the above issues.

>2.  The "fileno" attribute on the returned iterable.  I'm a bit
>concerned about using operating system file descriptors, due to
>resource constraints; I think a better check would be to see if the
>returned iterable is a subclass of the "file" class.  That is,
>"isinstance(f, file)" should return true.

The purpose of 'fileno' is specifically to allow the use of operating 
system APIs that copy data from one file descriptor to another.  Many 
Python objects have valid 'fileno' attributes besides files, including 
sockets and pipes.  Many non-stdlib objects in common use have 'fileno' 
attributes that serve this purpose.  'select.select' takes objects with 
'fileno', and so on.

Because 'file' has a 'fileno' attribute, 'isinstance(f,file)' implies 
'hasattr(f,"fileno")'.  Therefore, the latter is the preferred behavior 
here, because it doesn't unnecessarily exclude other valid wrappers of file 
descriptors.

>3.  Comments about "The [status-line] string must be 7-bit
>ASCII...containing no control characters."  That's overly restrictive;
>I think it would be better to simply refer to RFC 2616 and say that it
>should follow the rules defined there for "Reason-Phrase".
>
>4.  Similarly, the rules about header values are more restrictive than
>HTTP; they therefore prevent perfectly valid HTTP header values from
>being returned.  That's bad.  Again, I think the PEP should simply
>refer to RFC 2616 and say, "Use those rules".

These restrictions are intended to simplify servers and middleware; nobody 
has yet presented an example of a scenario where this imposed any practical 
limitation.

The fallback position would be that the status string and headers must not 
be CR or CRLF terminated.  But, I'd prefer to stick with a "no embedded 
control characters" approach, mainly to avoid situations where people embed 
'\n' and think that will be correct.

Here's what RFC 2616 has to say about TEXT, which is the format of the 
status message and of header values:

    The TEXT rule is only used for descriptive field contents and values
    that are not intended to be interpreted by the message parser. Words
    of *TEXT MAY contain characters from character sets other than ISO-
    8859-1 [22] only when encoded according to the rules of RFC 2047
    [14].

        TEXT           = <any OCTET except CTLs,
                         but including LWS>

    A CRLF is allowed in the definition of TEXT only as part of a header
    field continuation. It is expected that the folding LWS will be
    replaced with a single SP before interpretation of the TEXT value.

In other words, no control characters except for folding, and 7-bit ASCII 
with optional ISO-8859-1.  In practice, however, RFC 2047 allows for 
encoding ISO-8859-1 *in* 7-bit ASCII as well.  So, the only actual 
limitation being imposed by the PEP is on folding, and on the necessary 
encoding of non-ASCII characters.

Again, this is a practicality v. purity issue.  Are you aware of any 
applications that currently fold their headers, or transmit ISO-8859-1 
characters without using the encoding prescribed by RFC 2047?  Is there a 
practical use case for either one?

I'm willing to listen on this point, but as of the moment I find it hard to 
imagine what the use case for either of these features is.  By contrast, I 
do have very specific use cases in mind where supporting those features 
causes problems:

* Applications creating broken headers (e.g. with '\n' instead of '\r\n') 
or broken folds

* Applications mistakenly transmitting Unicode without considering encoding 
issues

* Middleware and servers forgetting to factor out folds when parsing data 
for interpretation

* In order to ensure safe interpretation, smart middleware and server 
developers will have to write routines to *unfold* potentially-folded 
headers; why not just disallow folding to begin with?

>5.  The phrase about "if a server or gateway discards or overrides any
>application header for any reason, it must record this in a log"; that
>should be "should" instead of "must".  Otherwise you'll have your log
>cluttered with innocuous header re-write messages, and no way to turn
>that off.

How about "must provide the *option*" and "must be enabled by default"? Or, 
leave it as is, but add something like, "may provide the user with the 
option of suppressing this output, so that users who cannot fix a broken 
application are not forced to bear the pain of its error."

>6.  The "write()" callable is important; it should not be deprecated
>or in some other way made a poor stepchild of the iterable.

But it *is* one.  The presence of the 'write()' facility significantly 
increases the implementation complexity for middleware and server 
authors.  If it weren't necessary to support existing streaming APIs, it 
wouldn't exist.

Earlier drafts treated it as a peer, which led to people making bad 
assumptions about its proper use.  Making it a "poor stepchild" encourages 
people to investigate it only if they really need it, and only a very few 
applications actually need it.

>7.  If an application returns an iterable after calling write(), are
>the strings produced by iteration written after those written by calls
>to write?

Yes.  This is implicit in the way 'write()' and the iterable are defined, 
because the server must transmit a block yielded or passed to write() 
before returning control to the application.  The only way to meet this 
constraint is for them to occur in sequence.

However, the language should perhaps be clarified to be explicit about this 
point, and to address what happens if code *within* the iterator calls 
'write()'.  (I don't think it should be allowed to, but I'm open to 
arguments either way.)

>8.  The note on Unicode: Unfortunately, Web standards like HTTP rely
>on using proper character sets.  By *not* using Unicode strings, and
>by *not* specifying the character set encoding of the "raw" byte
>strings, we open the door for disastrous misunderstandings.  The
>safest thing to do would be to require the framework to traffic in
>Unicode strings for things like header values, which the WSGI
>middleware would translate to or from the various required encodings
>used by the server and external protocols.  At least with Unicode
>strings you know what encoding is being used.

This seems at odds with your previous desire to use RFC 2616, which is 
pretty clear that it's ISO-8859-1 or RFC 2047.  PEP 333 goes further and 
says, it's ASCII, dammit, and use MIME header encodings (RFC 2047) if you 
need to do something special, because God help you if you're trying to mess 
with non-ASCII in HTTP headers and you don't know how to deal with that stuff.

Granted, that part could be more explicit in the PEP, so I'll work on that.  :)

(Maybe not this week; I expect to spend tomorrow putting hurricane panels 
on my house, just ahead of Frances' arrival...)

>A riskier, more error-prone option would be to require the byte
>strings to be in particular encodings.

That's actually what's required, it's merely implied by the PEP rather than 
explicitly stated.  But it's a fully RFC-compliant way to do it.

>The content strings, those written to the "write()" calls, or returned
>by the iterable, should in fact be byte vectors, exactly as they are
>currently specified.

Glad there was something you liked.  ;)  (j/k)

>9.  There should be a non-optional way of indicating the URL scheme,
>whether it is "http", "https", or "ftp".  I'd suggest "wsgi.scheme" in
>the environ.

I rather like this, although I don't at all see how FTP gets into 
this.  What the heck would CGI variables for FTP look like, I 
wonder?  Anyway, it's handy for "http" and "https" at the very least.  I'd 
prefer "wsgi.url_scheme" for the name, though, as it's otherwise a somewhat 
ambiguous name.