[Web-SIG] resources for porting wsgi apps from python 2 to 3
And Clover
and-py at doxdesk.com
Tue Oct 2 14:38:27 CEST 2012
On 01/10/12 18:07, chris.dent at gmail.com wrote:
> * Use bytes or str for environ keys?
> * Use bytes or str for environ values?
str, decoded from the request bytes using ISO-8859-1.
> * Are all environ values created equal or would, for example,
> QUERY_STRING's value (prior to any parameter to decoding)
> be handled differently from HTTP_COOKIE
All environ values are created equal (other than the CGI-mandated odd
decoding behaviour of SCRIPT_NAME and PATH_INFO).
> * If str, I see that ISO-8859-1 is the assumed encoding. How much
> hurt occurs in the world if I just assume utf-8 when decoding to
> str[4]?
Immediately, all non-ASCII characters in the path would be interpreted
incorrectly.
The more general hurt to the world would be that we would continue the
sad pre-PEP3333 situation where every web server handles non-ASCII
characters differently, and so no WSGI application can reliably use
Unicode in path segments.
There is little impact to any header other than the path, because
non-ASCII characters almost never appear in them. The query string
remains %-encoded so any non-ASCII characters are safe. The other places
users can put non-ASCII characters are in cookies and HTTP Basic
Authorisation headers, but browser support here is so variable/broken
that Python's handling would be the least of your worries.
> [4] Which is what it should have been all along?
Not necessarily. Even if you decide that all web apps must use UTF-8 for
text encoding, it's valid to have URL-encoded, non-text binary data in a
path segment. This would be unrecoverable using straight UTF-8.
(They would be recoverable if surrogateescape were used, but PEP 3333
has to encompass language versions that don't have surrogateescape, and
also it's questionable whether it should be possible to smuggle
non-UTF-8 data into strings that applications assume are safe.)
Plus header values are less likely to be UTF-8, and HTTP specifies that
they're ISO-8859-1 (even if that is not well-observed by browsers).
Ideally, the interfaces should all be bytes, because HTTP is defined in
terms of bytes. But that plays poorly with Python 3's default Unicode
strs (for environ et al). So ISO-8859-1 was chosen as a str interface
for which the original bytes can at least be recovered.
> * Should start_response only accept bytes (and error if not), or
> should it also accept str and encode appropriately?
status and response_headers are, like the request headers, native str
(to be ISO-8859-1 encoded). It's only the HTTP entity body that is
always bytestring.
> * Should the returned iterable be rejected or encoded if not bytes?
I don't think it's specified by the PEP, but wsgiref looks like it'll
chuck TypeError when it tries to write str to the buffer/socket.
cheers,
--
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobince at gmail.com
More information about the Web-SIG
mailing list