[Web-SIG] WSGI for Python 3

Sat Jul 17 08:06:07 CEST 2010

On Saturday, July 17, 2010, Gustavo Narea <me at gustavonarea.net> wrote:
> Hello,
>
> Ian said:
>> Having two ways of expressing the same information will lead to bugs
>> related to which data is canonical.  If an application is using
>> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
>> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
>> weird bugs and code will disagree about which one is correct.  Since %2f
>> can exist in the raw versions, there isn't even a way to chunk the two
>> variables in the same way.
>
> I can't agree more.
>
> I would propose the following, and excuse me in advance if this has already
> been proposed and discarded -- I've tried to follow this topic on the mailing
> list over the past few months, until it becomes an endless discussion.
>
> I think only the raw values should be available. Even if a middleware changes
> them, it must put them with raw values. And because you cannot change those
> values without knowing what encoding the request uses, the character encoding
> *must* be present.
>
> I know that sounds easy but it's not, because browsers don't specify the
> charset in the Content-Type and instead they generate a new request using the
> charset from the previous response. So the charset is unknown to the
> server/gateway and the middleware stack.
>
> So, what we could do is introduce a mandatory variable called, say,
> wsgi.charset, and would be used as follows:

Something like this was proposed before, but it only applied to the
keys that mattered, specifically PATH_INFO and maybe QUERY_STRING,
(the latter of which this discussion has been ignoring and I can't
remember how we worked out before it should be treated). It didn't
cover SCRIPT_NAME as as I indicated before, the encoding of that is
really dictated by the server and not the application for the initial
value at least.

The idea was that the server would pass them as Latin 1 and set the
encoding key. If a consumer of it didn't like the encoding it was in,
it would convert it back to bytes and then to what it wants and update
the encoding key to what it used. Thus you had a hint available to
allow reliable transcoding. This proposal didn't get acceptance
either.

Graham

>  - It MUST be set by the server or gateway on every request.
>  - Every middleware or application that reads or writes these values MUST use
> the charset specified in wsgi.charset.
>  - If a server, gateway, middleware or application wants to change the charset
> and it is possible*, it MUST convert the *entire* request into that charset
> and update wsgi.charset accordingly.
>  - When the charset is not specified in the HTTP request, UTF-8 MUST be
> assumed by the server/gateway. Unless another default charset has been
> specified by the user.
>
> I think/hope that will solve all the problems.
>
> What happens when a WSGI application is actually made up two WSGI applications
> and they send the responses in different charsets? If it's not possible to
> configure them so that they both use the same charsets, then one of them would
> have to be wrapped by a middleware which:
>  - On egress, converts the responses using the charset used by the other
> application.
>  - On ingress, if the charset is not specified in the request, it will assume
> it's the one used by the other application, and thus it will convert the
> request using the charset supported by the wrapped application.
>
> It would look like this:
> ===
> def application(environ, start_response):
>     if environ.startswith("/trac/"):
>         # Say Trac only supports Latin-1 and we want responses to use UTF-8:
>         app = trac.web.main.dispatch_request
>         app = CharsetNormalizer(app, response="latin-1", request="utf8")
>     else:
>         # myapp uses UTF-8
>         app = myapp
>     return app(environ, start_response)
> ===
>
> Then there's the string vs bytes issue. Bytes would be the natural choice to
> represent these raw values, but it would probably cause more trouble than they
> solve. So, I think they should be strings that contain the the ASCII raw
> encoded values (i.e., str on both versions of Python).
>
> What do you think about this? Again, sorry if this has been discarded before!
> :)
>
> * For example, you can always convert Latin-1 to UTF-8, but not every UTF-8
> string can be converted to Latin-1.
> --
> Gustavo Narea <xri://=Gustavo>.
> | Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>