[Web-SIG] resources for porting wsgi apps from python 2 to 3

Tue Oct 2 14:38:27 CEST 2012

On 01/10/12 18:07, chris.dent at gmail.com wrote:
 >     * Use bytes or str for environ keys?
 >     * Use bytes or str for environ values?

str, decoded from the request bytes using ISO-8859-1.

 >       * Are all environ values created equal or would, for example,
 >         QUERY_STRING's value (prior to any parameter to decoding)
 >         be handled differently from HTTP_COOKIE

All environ values are created equal (other than the CGI-mandated odd 
decoding behaviour of SCRIPT_NAME and PATH_INFO).

 >       * If str, I see that ISO-8859-1 is the assumed encoding. How much
 >         hurt occurs in the world if I just assume utf-8 when decoding to
 >         str[4]?

Immediately, all non-ASCII characters in the path would be interpreted 
incorrectly.

The more general hurt to the world would be that we would continue the 
sad pre-PEP3333 situation where every web server handles non-ASCII 
characters differently, and so no WSGI application can reliably use 
Unicode in path segments.

There is little impact to any header other than the path, because 
non-ASCII characters almost never appear in them. The query string 
remains %-encoded so any non-ASCII characters are safe. The other places 
users can put non-ASCII characters are in cookies and HTTP Basic 
Authorisation headers, but browser support here is so variable/broken 
that Python's handling would be the least of your worries.

 > [4] Which is what it should have been all along?

Not necessarily. Even if you decide that all web apps must use UTF-8 for 
text encoding, it's valid to have URL-encoded, non-text binary data in a 
path segment. This would be unrecoverable using straight UTF-8.

(They would be recoverable if surrogateescape were used, but PEP 3333 
has to encompass language versions that don't have surrogateescape, and 
also it's questionable whether it should be possible to smuggle 
non-UTF-8 data into strings that applications assume are safe.)

Plus header values are less likely to be UTF-8, and HTTP specifies that 
they're ISO-8859-1 (even if that is not well-observed by browsers).

Ideally, the interfaces should all be bytes, because HTTP is defined in 
terms of bytes. But that plays poorly with Python 3's default Unicode 
strs (for environ et al). So ISO-8859-1 was chosen as  a str interface 
for which the original bytes can at least be recovered.

 >     * Should start_response only accept bytes (and error if not), or
 >       should it also accept str and encode appropriately?

status and response_headers are, like the request headers, native str 
(to be ISO-8859-1 encoded). It's only the HTTP entity body that is 
always bytestring.

 >     * Should the returned iterable be rejected or encoded if not bytes?

I don't think it's specified by the PEP, but wsgiref looks like it'll 
chuck TypeError when it tries to write str to the buffer/socket.

cheers,

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobince at gmail.com