[Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

Manlio Perillo manlio_perillo at libero.it
Thu Dec 3 21:15:06 CET 2009


And Clover ha scritto:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>   url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes
> to be trivially extracted. This is consistent with the other headers and
> would be my preferred approach.
> 

There is something that I don't understand.

Some HTTP headers, like Accept-Language, contains data described as
`token`, where:

token          = 1*<any CHAR except CTLs or separators>

So a token, IMHO, is an opaque string, and it SHOULD not decoded.
In Python 3.x it SHOULD be a byte string.

Text content is described as `TEXT`, where:

The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].

    TEXT           = <any OCTET except CTLs,
                     but including LWS>


The only type of data where TEXT can be used is `quoted-string`.

A `quoted-string` only appears in well specified portions of an header.
So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP
headers as Unicode strings.

This is up to the application/framework, that must parse each header,
split it in component and handle them as more appropriate (as byte
string, Unicode string or instance of some other data type).


> [...]


Regards   Manlio


More information about the Web-SIG mailing list