[Web-SIG] Python 3.0 and WSGI 1.0.

Tue May 5 02:21:14 CEST 2009

2009/5/5 Armin Ronacher <armin.ronacher at active-4.com>:
> Hello everybody,
>
> I just recently started looking at supporting Python 3 with one of my libraries
> (Werkzeug), mainly because the MoinMoin projects considers using it which uses
> the library in question.  Right now what Werkzeug does is consider HTTP being
> Unicode aware in the sense that everything that carries text data is encoded and
> decoded into a known encoding.
>
> This is partially against the specification and not entirely correct, but it
> works the best on modern browsers and is also what Django and Paste are doing.
>
> It's basically that the incoming request data is .decode(encoding)d (usually
> utf-8) before passed to the user code and unicode data is encoded back into the
> same encoding before it's sent to the server.
>
> Now why is the current behavior of Python 3 a problem here?  The encode, decode
> hack from above is obviously a solution for these kinds of applications, albeit
> not a good one.  Interfaces like mod_wsgi already have the data as bytestring,
> would decode it from latin1 just that the application can encode it back and
> decode as utf-8.  Not only is this slow but also does this mean that the code
> does not survive a run through 2to3.
>
> Now you could argue that the libraries where wrong in the first place and should
> support unicode strings that were encoded from latin1 and decoded, but seems
> like very few libraries support that.
>
> Now which strings carry data that could contain non-ascii characters from a
> source with an unknown encoding?  Right now these are the following:
>
>  * PATH_INFO
>  * SCRIPT_NAME
>  * QUERY_STRING
>  * CONTENT_TYPE
>  * HTTP_*

Depending on underlying web server that WSGI adapter runs on, there
might also be:

  REQUEST_URI
  PATH_TRANSLATED (??)

Yes I know these aren't required for WSGI, except to the extent that
WSGI specification says:

  "A server or gateway should attempt to provide as many other CGI
variables as are applicable."

Would have to check CGI but there may be more.

The way I thus read this is that keys are always strings, values will
be strings, except for specific list of entries where values would be
bytes. Also, presume that wsgi.url_scheme will have string value.

Where things get difficult for me with Apache is where users can use
SetEnv or mod_rewrite to define additional key/values to be added to
the WSGI environment. For example:

  SetEnv trac.env_path /some/path

I can't see but have choice but to pass such settings through as
strings, else more than likely would cause problems for applications.
Problem is it isn't clear what encoding stuff can be in Apache
configuration. At the moment latin-1 is assumed.

Things though get more complicated when mod_rewrite is used, as the
values could be derived from components of the URL which are being
treated as bytes above. For example:

 RewriteCond %{THE_REQUEST} ^\ *([A-Z]+)\ *(.*)\ *(HTTP/.*)$
 RewriteRule . - [E=UNPARSERD_URI:%1]

So, this is creating a new UNPARSED_URI value which is original URL as
appeared in the request line. I can't know that strictly speaking that
this should be bytes.

As such, I think all I can do is always pass through additional values
as string, interpreted as latin-1. If some special case handling is
required, would be up to WSGI application. I am not too keen on
special configuration directives to allow encoding and/or whether
bytes used, to be specified for each possible variable being set.

Anyway, this is special case stuff and if being done is likely going
to be special to Apache/mod_wsgi. If people want consistency, they
should just implement it as a WSGI middleware where they can rather
than usind mod_rewrite fiddles.

Now, if we are going to start using bytes for request headers, there
is the other question of response data.

The original proposal in amendments was that application should
provide bytes, but that WSGI adapter must accept either bytes or
strings, with strings interpreted as latin-1.

Is there sense in being more strict in this case?

In Python 2.X some WSGI adapters only allow Python 2.X strings (ie.,
bytes) and reject unicode strings. Others will convert unicode
strings, but rather than use latin-1, apply the default Python
encoding. Thus, there is no consistency.

As to wsgi.file_wrapper, the only logical thing seems to be required
file object to return bytes, ie. raw mode, and not be in text mode.

Ultimately I am just implementing the WSGI adapter, I'll follow
whatever is decided. I am not in a position, since I don't develop
stuff that runs on it, to know what is best. So, as long as it is
clear what should be passed through as bytes for environment, ie.,
there is an all inclusive list, and don't somehow have to guess, then
am fine either way. I'd just like to see some decision and for that
decision not to be some time next year as am holding up mod_wsgi 3.0
until things have been clarified. :-(

Graham

> Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and
> CONTENT_TYPE).  Now it's true that the headers should not contain non latin1
> values but reality shows that they do.  Cookies are transmitted as headers as
> well and no browser complains if you put utf-8 encoded stuff into it.  It may be
> the case that for the browser this looks like latin1, but in the end the
> application decodes it from utf-8 and is happy.
>
> Data sent from the application can continue to work like they do currently.
> However for django, Werkzeug, paste and many others that support unicode output
> will just check if the output is unicode, and if that's the case, encode to the
> desired encoding.
>
> Also people abuse middlewares a lot and they deal with incoming and outgoing
> data as well.  One can expect these middlewares to work on known encodings as
> well so those would do the encode / decode dance too.
>
> If one knows the encoding of the environ, then the webserver.  Apparently there
> are issues getting the encoding of the environ but those won't go away when
> moving that to the web application.
>
> Because of that I propose that Python 3 would ship a version of wsgiref with
> Python 3.1 that uses bytestrings for the headers in question and add a section
> on Python 3 compatibility based on that to PEP 333.
>
> I volunteer for writing a new section on Python 3 in PEP 333 :-)
>
>
> Regards,
> Armin
>
> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>