[Web-SIG] WSGI 2

Tue Aug 4 17:12:50 CEST 2009

Graham Dumpleton wrote:
> Ian, know you have seen this before, but didn't realise you hadn't
> cc'd the list. I have added a new response to part 4 of what you
> originally sent that wasn't in first reply that went direct to you.
> 
> 2009/8/4 Ian Bicking <ianb at colorstudy.com>:
>> On Mon, Aug 3, 2009 at 7:38 PM, Graham
>> Dumpleton<graham.dumpleton at gmail.com> wrote:
>>> So, for WSGI 1.0 style of interface and Python 3.0, the following is
>>> what I was going to implement.
>>>
>>> 1. When running under Python 3, applications SHOULD produce bytes
>>> output, status line and headers.
>> Sure.
>>
>>> This is effectively what we had before. The only difference is that
>>> clarify that the 'status line' values should also be bytes. This
>>> wasn't noted before. I had already updated the proposed WSGI 1.0
>>> amendments page to mention this.
>>>
>>> 2. When running under Python 3, servers and gateways MUST accept
>>> strings for output, status line and headers. Such strings must be
>>> converted to bytes output using 'latin-1'. If string cannot be
>>> converted then is treated as an error.
>>>
>>> This is again what we had before except that mention 'status line' value.
>> Sure.  ASCII for the status would be acceptable, as I believe that is
>> an HTTP constraint.
>>
>>> 3. When running under Python 3, servers MUST provide wsgi.input as a
>>> binary (byte) input stream.
>>>
>>> No change here.
>> Yep.
>>
>>> 4. When running under Python 3, servers MUST provide a text stream for
>>> wsgi.errors. In converting this to a byte stream for writing to a
>>> file, the default encoding would be applied.
>>>
>>> No real change here except to clarify that default encoding would
>>> apply. Use of default encoding though could be problematic if
>>> combining different WSGI components. This is because each WSGI
>>> component may have been developed on system with different default
>>> encoding and so one may expect to log characters that can't be written
>>> on a different setup. Not sure how you could solve that except to say
>>> people have default encoding be UTF-8 for portability.
>> Sure.  We might specify that the server should never give an encoding
>> error; it should use 'replace' or something to make sure it won't
>> fail.  Maybe it should be specified what should happen when bytes are
>> received.  I generally believe that error handling code should try to
>> be as robust as possible, so it shouldn't fail regardless of what it
>> is given.
> 
> Not that it matters, but looks like that for Apache/mod_wsgi
> wsgi.errors should be an instance of io.TextIOWrapper wrapping
> internal mod_wsgi specific buffer object providing interface
> compatible with io.BufferedIOBase. If someone uses write() on wrapper
> with bytes it will fail:
> 
>   TypeError: write() argument 1 must be str, not bytes
> 
> If someone use print() to output data, then bytes would be converted
> okay. That is:
> 
>   print(b'1234', file=environ['wsgi.errors'])
> 
> yields:
> 
>   b'1234'.
> 
> If 'replace' is used for errors, you do end up with data loss. Use of
> 'xmlcharrefreplace' at least preserves values as numbers, although for
> Apache at least, if use 'ascii' encoding, you get a bit of a mess as
> the backslashes get escaped again.
> 
> \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10
> 
> instead of original:
> 
> \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10
> 
> That is because Apache logging functions escape anything which isn't
> printable ASCII and in turn escapes backslash denoting escaped
> character.
> 
> If use encoding of utf-8 instead, then byte values get passed and
> Apache logging functions then just escape the non printable bytes
> instead so all up looks nicer.
> 
> \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
> \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90
> 
> So for Apache/mod_wsgi at least, best thing to do seems to use
> 'replace' and 'utf-8' due to way that Apache error logging functions
> work.
> 
> I guess the point from this is that possibly should specify that
> wsgi.errors should be an instance of io.TextIOWrapper. A specific
> implementation should not use 'strict', but use 'replace' or
> 'backslashreplace' as makes sense, dependent on what encoding it needs
> to use and how any underlying logging system it overlays works. The
> intent overall being to preserve as much of raw information as
> possible.
> 
>>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>>> server variables as strings. Where such values are sourced from a byte
>>> string, be that a Python byte string or C string, they should be
>>> converted as 'UTF-8'. If a specific web server infrastructure is able
>>> to support different encodings, then the WSGI adapter MAY provide a
>>> way for a user of the WSGI adapter to customise on a global basis, or
>>> on a per value basis what encoding is used, but this is entirely
>>> optional. Note that there is no requirement to deal with RFC 2047.
>> Ugh.  This is where I'm not happy with how WSGI 1 in Python 3 has been
>> treated.  I think it should be bytes, just like it is in Python 2.
> 
> I still don't understand what is the practical, vs theoretical use
> case for that in Python 3. In Python 2 bytes strings work out okay
> because url routing rules through whatever means is generally also
> going to be defined in terms of byte strings. In Python 3 however,
> routing is going to likely default to being defined with strings and
> as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING
> are going to have to almost immediately be converted to strings from
> bytes to apply routing rules anyway.
> 
> Can you expand on what benefits come from and what practical use case
> would predominate that would mean that bytes would be the better
> option?
> 
>> But if we have an encoding, I guess UTF8 is okay so long as it uses
>> PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part
>> PEP 383, and putting the encoding that was used into the environment,
>> makes transcoding doable.  PEP 383 doesn't allow for transcoding
>> unless you keep track of the encoding used, so we have to store that
>> in the environment.
> 
> Again, what practical use cases are there where transcoding would be
> necessary, especially if it was a requirement that the WSGI
> adapter/server at lowest level, if it makes sense for that server
> infrastructure, ie., can support something other than UTF-8, to
> provide an option to supply WSGI environ values, all or selected,
> interpreted as a different encoding?
> 
> If the option is at the WSGI adapter/server level and managed at the
> point of original translation from bytes, then a WSGI application or
> middleware doesn't need to worry about it. As such, noting what
> encoding was used in the environment serves no purpose except for
> information purposes. Marking what encoding was used also would not
> necessarily be straight forward if the WSGI adapter/server provided a
> way of overriding encoding used for specific values, because one value
> for encoding indicator would not suffice.
> 
> To allow experimentation with encoding of values, current mod_wsgi
> code allowed overriding of values on global or individual basis. This
> was done via an Apache directive, but as had to pass this information
> from main Apache worker process to mod_wsgi daemon process, did it in
> such a way that also visible to application for information purposes
> at this point. Was using convention as follows.
> 
>  # Override encoding for everything to UTF-8.
>  mod_wsgi.variable_encoding: UTF-8
> 
>  # Override encoding and pass raw byes for everything.
>  mod_wsgi.variable_encoding: -
> 
>  # Override encoding of specific value to UTF-8.
>  mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8
> 
>  # Override encoding and pass raw bytes for specific value.
>  mod_wsgi.variable_encoding.SCRIPT_NAME: -
> 
> If default encoding used for everything, then no value passed at all.
> 
> In respect of passing bytes for values, we get back to argument from
> past discussions as to what should be passed as bytes. Do you only do
> SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific
> variables such as REQUEST_URI? What about headers such as Referrer?
> What about custom user values set using something like SetEnv
> directive in Apache?
> 
> This is where it started to turn into a can of worms last time. You
> either treat everything as UTF-8 to be consistent, or use bytes for
> everything, in which case a great deal more work is put onto WSGI
> applications even for potentially simple stuff, effectively forcing
> the use of high level request wrappers like WebOb or request object in
> Werkzeug.
> 
> In summary, what are the practical uses cases that would make passing
> bytes over UTF-8 or even latin-1 worthwhile?
> 
> If passing bytes, what values should be passed as bytes and what left alone?
> 
> What practical use cases are there that would necessitate transcoding?

It's probably harder for newbies to understand transcoding, and
converting bytes to string, and vice-versa. I think that count as a
practical use case so that high-level frameworks can do some wrapping
around, thus potentially making the WSGI spec significantly harder to
implement in derivatives works. Thus, I'd not recommend to make WSGI 2
more obfuscated than necessary, unless supported by real-case scenarios
as Graham suggested.

Hoping not to have leaked too much fuel on the fire.. ;)

Etienne

-- 
Etienne Robillard <robillard.etienne at gmail.com>
Green Tea Hackers Club <http://gthc.org/>
Blog: <http://gthc.org/blog/>
PGP Fingerprint: AED6 B33B B41D 5F4F A92A  2B71 874C FB27 F3A9 BDCC