[Web-SIG] WSGI 2

Tue Aug 4 14:44:34 CEST 2009

Ian, know you have seen this before, but didn't realise you hadn't
cc'd the list. I have added a new response to part 4 of what you
originally sent that wasn't in first reply that went direct to you.

2009/8/4 Ian Bicking <ianb at colorstudy.com>:
> On Mon, Aug 3, 2009 at 7:38 PM, Graham
> Dumpleton<graham.dumpleton at gmail.com> wrote:
>> So, for WSGI 1.0 style of interface and Python 3.0, the following is
>> what I was going to implement.
>>
>> 1. When running under Python 3, applications SHOULD produce bytes
>> output, status line and headers.
>
> Sure.
>
>> This is effectively what we had before. The only difference is that
>> clarify that the 'status line' values should also be bytes. This
>> wasn't noted before. I had already updated the proposed WSGI 1.0
>> amendments page to mention this.
>>
>> 2. When running under Python 3, servers and gateways MUST accept
>> strings for output, status line and headers. Such strings must be
>> converted to bytes output using 'latin-1'. If string cannot be
>> converted then is treated as an error.
>>
>> This is again what we had before except that mention 'status line' value.
>
> Sure.  ASCII for the status would be acceptable, as I believe that is
> an HTTP constraint.
>
>> 3. When running under Python 3, servers MUST provide wsgi.input as a
>> binary (byte) input stream.
>>
>> No change here.
>
> Yep.
>
>> 4. When running under Python 3, servers MUST provide a text stream for
>> wsgi.errors. In converting this to a byte stream for writing to a
>> file, the default encoding would be applied.
>>
>> No real change here except to clarify that default encoding would
>> apply. Use of default encoding though could be problematic if
>> combining different WSGI components. This is because each WSGI
>> component may have been developed on system with different default
>> encoding and so one may expect to log characters that can't be written
>> on a different setup. Not sure how you could solve that except to say
>> people have default encoding be UTF-8 for portability.
>
> Sure.  We might specify that the server should never give an encoding
> error; it should use 'replace' or something to make sure it won't
> fail.  Maybe it should be specified what should happen when bytes are
> received.  I generally believe that error handling code should try to
> be as robust as possible, so it shouldn't fail regardless of what it
> is given.

Not that it matters, but looks like that for Apache/mod_wsgi
wsgi.errors should be an instance of io.TextIOWrapper wrapping
internal mod_wsgi specific buffer object providing interface
compatible with io.BufferedIOBase. If someone uses write() on wrapper
with bytes it will fail:

  TypeError: write() argument 1 must be str, not bytes

If someone use print() to output data, then bytes would be converted
okay. That is:

  print(b'1234', file=environ['wsgi.errors'])

yields:

  b'1234'.

If 'replace' is used for errors, you do end up with data loss. Use of
'xmlcharrefreplace' at least preserves values as numbers, although for
Apache at least, if use 'ascii' encoding, you get a bit of a mess as
the backslashes get escaped again.

\\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10

instead of original:

\u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10

That is because Apache logging functions escape anything which isn't
printable ASCII and in turn escapes backslash denoting escaped
character.

If use encoding of utf-8 instead, then byte values get passed and
Apache logging functions then just escape the non printable bytes
instead so all up looks nicer.

\xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
\xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90

So for Apache/mod_wsgi at least, best thing to do seems to use
'replace' and 'utf-8' due to way that Apache error logging functions
work.

I guess the point from this is that possibly should specify that
wsgi.errors should be an instance of io.TextIOWrapper. A specific
implementation should not use 'strict', but use 'replace' or
'backslashreplace' as makes sense, dependent on what encoding it needs
to use and how any underlying logging system it overlays works. The
intent overall being to preserve as much of raw information as
possible.

>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>> server variables as strings. Where such values are sourced from a byte
>> string, be that a Python byte string or C string, they should be
>> converted as 'UTF-8'. If a specific web server infrastructure is able
>> to support different encodings, then the WSGI adapter MAY provide a
>> way for a user of the WSGI adapter to customise on a global basis, or
>> on a per value basis what encoding is used, but this is entirely
>> optional. Note that there is no requirement to deal with RFC 2047.
>
> Ugh.  This is where I'm not happy with how WSGI 1 in Python 3 has been
> treated.  I think it should be bytes, just like it is in Python 2.

I still don't understand what is the practical, vs theoretical use
case for that in Python 3. In Python 2 bytes strings work out okay
because url routing rules through whatever means is generally also
going to be defined in terms of byte strings. In Python 3 however,
routing is going to likely default to being defined with strings and
as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING
are going to have to almost immediately be converted to strings from
bytes to apply routing rules anyway.

Can you expand on what benefits come from and what practical use case
would predominate that would mean that bytes would be the better
option?

> But if we have an encoding, I guess UTF8 is okay so long as it uses
> PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part
> PEP 383, and putting the encoding that was used into the environment,
> makes transcoding doable.  PEP 383 doesn't allow for transcoding
> unless you keep track of the encoding used, so we have to store that
> in the environment.

Again, what practical use cases are there where transcoding would be
necessary, especially if it was a requirement that the WSGI
adapter/server at lowest level, if it makes sense for that server
infrastructure, ie., can support something other than UTF-8, to
provide an option to supply WSGI environ values, all or selected,
interpreted as a different encoding?

If the option is at the WSGI adapter/server level and managed at the
point of original translation from bytes, then a WSGI application or
middleware doesn't need to worry about it. As such, noting what
encoding was used in the environment serves no purpose except for
information purposes. Marking what encoding was used also would not
necessarily be straight forward if the WSGI adapter/server provided a
way of overriding encoding used for specific values, because one value
for encoding indicator would not suffice.

To allow experimentation with encoding of values, current mod_wsgi
code allowed overriding of values on global or individual basis. This
was done via an Apache directive, but as had to pass this information
from main Apache worker process to mod_wsgi daemon process, did it in
such a way that also visible to application for information purposes
at this point. Was using convention as follows.

 # Override encoding for everything to UTF-8.
 mod_wsgi.variable_encoding: UTF-8

 # Override encoding and pass raw byes for everything.
 mod_wsgi.variable_encoding: -

 # Override encoding of specific value to UTF-8.
 mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8

 # Override encoding and pass raw bytes for specific value.
 mod_wsgi.variable_encoding.SCRIPT_NAME: -

If default encoding used for everything, then no value passed at all.

In respect of passing bytes for values, we get back to argument from
past discussions as to what should be passed as bytes. Do you only do
SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific
variables such as REQUEST_URI? What about headers such as Referrer?
What about custom user values set using something like SetEnv
directive in Apache?

This is where it started to turn into a can of worms last time. You
either treat everything as UTF-8 to be consistent, or use bytes for
everything, in which case a great deal more work is put onto WSGI
applications even for potentially simple stuff, effectively forcing
the use of high level request wrappers like WebOb or request object in
Werkzeug.

In summary, what are the practical uses cases that would make passing
bytes over UTF-8 or even latin-1 worthwhile?

If passing bytes, what values should be passed as bytes and what left alone?

What practical use cases are there that would necessitate transcoding?

Some actual practical examples of stuff would very much help in this
discussion as we tend to kee talking about what is theoretical
possibilities rather than actual practice.

Graham