From ianb at colorstudy.com Mon Aug 3 23:32:36 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 3 Aug 2009 16:32:36 -0500 Subject: [Web-SIG] WSGI 2 Message-ID: So... what about WSGI 2? Let's not completely drop the ball on this. I *think* we were largely in agreement; debate got distracted by some async stuff, but I don't think we particularly have to deal with that for WSGI 2. I think we do more than enough if we figure out: WSGI in Python 3, i.e., with unicode; some basic errata kind of stuff, like readline signature; change the callable signature to remove start_response. Would this be a new PEP or a revision? I think it should be a new PEP, as WSGI 1 remains valid and the same as it always was, and PEP 333 describes that. Is there anyone willing to make the revisions? -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From pje at telecommunity.com Tue Aug 4 02:11:04 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 03 Aug 2009 20:11:04 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: Message-ID: <20090804001114.71D633A4093@sparrow.telecommunity.com> At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote: >Would this be a new PEP or a revision? I think it should be a new >PEP, as WSGI 1 remains valid and the same as it always was, and PEP >333 describes that. +1 for a new PEP, since we'd be able to drop a lot of crufty examples and explanations about the cruddy bits. wsgiref should add 1->2 and 2->1 adapters. (Although technically, running a WSGI 1 application in a WSGI 2 server requires either threads or greenlets.) IMO, the main benefit of implementing WSGI 2 is to applications, not servers, with the possible exception of async servers (e.g. Twisted) that would prefer an iterator-only communications mode. Such servers could refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2->1 adapter. Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a standard 1->2 adapter to support WSGI 2. From graham.dumpleton at gmail.com Tue Aug 4 02:38:54 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 4 Aug 2009 10:38:54 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: Message-ID: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> 2009/8/4 Ian Bicking : > So... what about WSGI 2? ?Let's not completely drop the ball on this. > I *think* we were largely in agreement; debate got distracted by some > async stuff, but I don't think we particularly have to deal with that > for WSGI 2. ?I think we do more than enough if we figure out: WSGI in > Python 3, i.e., with unicode; some basic errata kind of stuff, like > readline signature; change the callable signature to remove > start_response. > > Would this be a new PEP or a revision? ?I think it should be a new > PEP, as WSGI 1 remains valid and the same as it always was, and PEP > 333 describes that. ?Is there anyone willing to make the revisions? But is the intention to skip straight to WSGI 2.0 for Python 3.0, with start_response() being eliminated, or are we going to provide amended WSGI 1.0 for Python 3.0? I can't see how we can avoid the latter and so we should focus on that first rather that more fundamental changes in WSGI 2.0. In respect of WSGI 1.0 for Python 3.0, I have pretty well come to the conclusion that where we were heading before on that in one area is wrong. I was about to make changes to mod_wsgi in line with what I believe should be done and just release it without consultation given that I couldn't see any discussion reaching any conclusion about it soon. Since you have sent this email I will try one last time to get a resolution on WSGI 1.0 for Python 3.0. If can't get one, I guess the choices are to release the change anyway and provide an incompatible implementation to what others are guessing should be done, or just rip all the code out and not support Python 3.0 at all. Either seem entirely reasonable since there is no WSGI 1.0 specification for Python 3.0 and the issue again looks to be getting avoided by skipping to a discussion on WSGI 2.0 instead. So, for WSGI 1.0 style of interface and Python 3.0, the following is what I was going to implement. 1. When running under Python 3, applications SHOULD produce bytes output, status line and headers. This is effectively what we had before. The only difference is that clarify that the 'status line' values should also be bytes. This wasn't noted before. I had already updated the proposed WSGI 1.0 amendments page to mention this. 2. When running under Python 3, servers and gateways MUST accept strings for output, status line and headers. Such strings must be converted to bytes output using 'latin-1'. If string cannot be converted then is treated as an error. This is again what we had before except that mention 'status line' value. 3. When running under Python 3, servers MUST provide wsgi.input as a binary (byte) input stream. No change here. 4. When running under Python 3, servers MUST provide a text stream for wsgi.errors. In converting this to a byte stream for writing to a file, the default encoding would be applied. No real change here except to clarify that default encoding would apply. Use of default encoding though could be problematic if combining different WSGI components. This is because each WSGI component may have been developed on system with different default encoding and so one may expect to log characters that can't be written on a different setup. Not sure how you could solve that except to say people have default encoding be UTF-8 for portability. 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there is no requirement to deal with RFC 2047. This is where I am going to diverge from what has been discussed before. The reason I am going to pass as UTF-8 and not latin-1 is that it looks like Apache effectively only supports use of UTF-8. Since this means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and even CGI likely cannot handle anything besides UTF-8 then I really can't see the point of trying to cater for a theoretical possibility that some HTTP client could use something besides UTF-8. In other words, the predominant case will be UTF-8, so let us target that. So, rather than burden every WSGI application with the need to convert from latin-1 back to bytes and then to UTF-8, let the server deal with it, with server using sensible default, and where server infrastructure can handle a different encoding, then it can provide option to use that encoding and WSGI application doesn't need to change. Now, the reason why Apache can't really handle anything besides UTF-8 relates to how filenames are encoded in the file system. Taking Windows first as it is the more obvious case. What Apache does there is take whatever path it has mapping to a script file, be it constructed partially from what is in Apache configuration and partially from what was supplied in URL from client, and converts it to UCS2 for passing to Windows file system routines. In converting to UCS2, Apache assumes that the path will be UTF-8. This means that the Apache configuration file has to be UTF-8 and that the URL as supplied by the client is UTF-8 as well after any URL character encoding is decoded. End result, can only handle UTF-8. For UNIX systems, Apache doesn't do any conversions of the path, but passes it direct to file system routines. On a Linux system supporting UTF-8 file system paths, then that path also need to be UTF-8 and that again implies that Apache configuration is UTF-8 and client decoded URL used in matching resource is also UTF-8. Again, by association of all the moving parts, must all be UTF-8. Now, what I am talking about here is the file system path constructed from file system location and some leading prefix of URL and which is used to match script file. So for URL, this is the SCRIPT_NAME part where it matches to a file system resource such as a script. Obviously there is going to be some amount of URL left over, ie., PATH_INFO and QUERY_STRING. Also shown though that SCRIPT_NAME part has to be UTF-8 and we would really be entering fantasy land if you were somehow going to cope with some different encoding for PATH_INFO and QUERY_STRING. Instead it is like the GPL, viral in nature. Use of UTF-8 in one particular area means you are effectively bound to use UTF-8 everywhere else. Further example of why UTF-8 reaches into everything is mod_rewrite module for Apache. This allows you to do stuff related to SCRIPT_NAME, PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache configuration file has to be UTF-8. If URL isn't, then wouldn't be possible to perform matches against non latin-1 characters in a rewrite condition or rule. This is because your match string would be in different encoded form to that in URL and so wouldn't match. Now this is all for Apache. Unless they do strange stuff, I would expect that other web servers such as lighttpd, nginx and Cherokee would also have this UTF-8 dependence all through it. This would potentially leave only pure Python web servers that might be able to handle doing stuff as some other encoding. But although that technically may be possible, should that, given that anyone wanting to use a different encoding is likely to be small or non existent, dictate what should be done for everyone, especially if servers wanting to handle different encodings could provide a configuration option to allow it anyway and thus not burden the WSGI application. In summary, just seems more sane to have stuff in WSGI environment be dealt with as UTF-8. So, can we please address this rather than being distracted by WSGI 2.0. The same issue is going to have to be dealt with for WSGI 2.0 anyway, but working it out now means that we can at least deliver a WSGI 1.0 update for Python 3.0. Graham From graham.dumpleton at gmail.com Tue Aug 4 02:48:41 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 4 Aug 2009 10:48:41 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804001114.71D633A4093@sparrow.telecommunity.com> References: <20090804001114.71D633A4093@sparrow.telecommunity.com> Message-ID: <88e286470908031748n30fd7e7cgf902d445e90b59a1@mail.gmail.com> 2009/8/4 P.J. Eby : > At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote: >> >> Would this be a new PEP or a revision? ?I think it should be a new >> PEP, as WSGI 1 remains valid and the same as it always was, and PEP >> 333 describes that. > > +1 for a new PEP, since we'd be able to drop a lot of crufty examples and > explanations about the cruddy bits. ?wsgiref should add 1->2 and 2->1 > adapters. ?(Although technically, running a WSGI 1 application in a WSGI 2 > server requires either threads or greenlets.) > > IMO, the main benefit of implementing WSGI 2 is to applications, not > servers, with the possible exception of async servers (e.g. Twisted) that > would prefer an iterator-only communications mode. ?Such servers could > refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2->1 > adapter. > > Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a > standard 1->2 adapter to support WSGI 2. Personally I don't believe we should be trying to support async servers in the WSGI specification. Leave it simple and cater for the predominant case rather than make it complicated just to support what is going to be a minority deployment. It was async servers that got the whole discussion derailed last time. Leave input stream as is now as it is a known quantity and shown through actual use to work acceptably. Changing to an input iterator in my mind introduces too many unknowns around how input buffering is going to behave. In worst case you could really screw up performance because of a trickle of input coming into an application where no way for an application to control block size of what is read. Let us find some other way of supporting async servers, but not by changing WSGI interface itself. Graham From mark.mchristensen at gmail.com Tue Aug 4 03:22:28 2009 From: mark.mchristensen at gmail.com (Mark Ramm) Date: Mon, 3 Aug 2009 21:22:28 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: > In summary, just seems more sane to have stuff in WSGI environment be > dealt with as UTF-8. This sounds good to me. Rack, Jack, and even java servlets seem to make this assumption without significant trouble, and if nearly all existing web servers do it internally, that's seems like an even better argument. --Mark Ramm From mark.mchristensen at gmail.com Tue Aug 4 03:23:15 2009 From: mark.mchristensen at gmail.com (Mark Ramm) Date: Mon, 3 Aug 2009 21:23:15 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908031748n30fd7e7cgf902d445e90b59a1@mail.gmail.com> References: <20090804001114.71D633A4093@sparrow.telecommunity.com> <88e286470908031748n30fd7e7cgf902d445e90b59a1@mail.gmail.com> Message-ID: > Personally I don't believe we should be trying to support async > servers in the WSGI specification. Leave it simple and cater for the > predominant case rather than make it complicated just to support what > is going to be a minority deployment. It was async servers that got > the whole discussion derailed last time. Leave input stream as is now > as it is a known quantity and shown through actual use to work > acceptably. Changing to an input iterator in my mind introduces too > many unknowns around how input buffering is going to behave. In worst > case you could really screw up performance because of a trickle of > input coming into an application where no way for an application to > control block size of what is read. Yea, someone at work suggested that we should read from the input in a file like way, and include a little chained file implementation in wsgi ref, or just point to it in the spec, so people can read the first 1000 bytes off of the input, and then pass along what they read, plus the rest of the file in a way that's transparent to the underlying application. Makes good sense to me, and I'm pretty sure I can find Rick's ittertools.chain inspired chained file implementation. --Mark From graham.dumpleton at gmail.com Tue Aug 4 04:24:42 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 4 Aug 2009 12:24:42 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <88e286470908031924u5ba31a81q7cd94259b187a86f@mail.gmail.com> 2009/8/4 Mark Ramm : >> In summary, just seems more sane to have stuff in WSGI environment be >> dealt with as UTF-8. > > This sounds good to me. ? Rack, Jack, and even java servlets seem to > make this assumption without significant trouble, and if nearly all > existing web servers do it internally, that's seems like an even > better argument. What do they do for response side though? Do they have the bytes/string distinct that we are talking about, with bytes expected by string accepted but only in representable as latin-1? Graham From pje at telecommunity.com Tue Aug 4 05:39:06 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 03 Aug 2009 23:39:06 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908031748n30fd7e7cgf902d445e90b59a1@mail.gmail.co m> References: <20090804001114.71D633A4093@sparrow.telecommunity.com> <88e286470908031748n30fd7e7cgf902d445e90b59a1@mail.gmail.com> Message-ID: <20090804033912.A11DA3A4093@sparrow.telecommunity.com> At 10:48 AM 8/4/2009 +1000, Graham Dumpleton wrote: >2009/8/4 P.J. Eby : > > At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote: > >> > >> Would this be a new PEP or a revision? I think it should be a new > >> PEP, as WSGI 1 remains valid and the same as it always was, and PEP > >> 333 describes that. > > > > +1 for a new PEP, since we'd be able to drop a lot of crufty examples and > > explanations about the cruddy bits. wsgiref should add 1->2 and 2->1 > > adapters. (Although technically, running a WSGI 1 application in a WSGI 2 > > server requires either threads or greenlets.) > > > > IMO, the main benefit of implementing WSGI 2 is to applications, not > > servers, with the possible exception of async servers (e.g. Twisted) that > > would prefer an iterator-only communications mode. Such servers could > > refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2->1 > > adapter. > > > > Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a > > standard 1->2 adapter to support WSGI 2. > >Personally I don't believe we should be trying to support async >servers in the WSGI specification. I'm not suggesting adding anything for async servers; I'm just saying that they will likely prefer to use WSGI 2 and use a 2->1 adapter to do WSGI 1 support, whereas synchronous servers will likely prefer the reverse. The WSGI spec doesn't currently require streaming upload support, so if an async server wants to buffer the input (e.g. to a temp file) rather than trusting the application to handle reads, it's free to do so. (And that's independent of whether it's WSGI 1 or 2 being used.) From pje at telecommunity.com Tue Aug 4 06:00:44 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 04 Aug 2009 00:00:44 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.co m> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <20090804040049.E8C563A4093@sparrow.telecommunity.com> At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote: >1. When running under Python 3, applications SHOULD produce bytes >output, status line and headers. > >This is effectively what we had before. The only difference is that >clarify that the 'status line' values should also be bytes. This >wasn't noted before. I had already updated the proposed WSGI 1.0 >amendments page to mention this. +1 >2. When running under Python 3, servers and gateways MUST accept >strings for output, status line and headers. Such strings must be >converted to bytes output using 'latin-1'. If string cannot be >converted then is treated as an error. > >This is again what we had before except that mention 'status line' value. > >3. When running under Python 3, servers MUST provide wsgi.input as a >binary (byte) input stream. > >No change here. > >4. When running under Python 3, servers MUST provide a text stream for >wsgi.errors. In converting this to a byte stream for writing to a >file, the default encoding would be applied. > >No real change here except to clarify that default encoding would >apply. Use of default encoding though could be problematic if >combining different WSGI components. This is because each WSGI >component may have been developed on system with different default >encoding and so one may expect to log characters that can't be written >on a different setup. Not sure how you could solve that except to say >people have default encoding be UTF-8 for portability. Also +1. >5. When running under Python 3, servers MUST provide CGI HTTP and >server variables as strings. Where such values are sourced from a byte >string, be that a Python byte string or C string, they should be >converted as 'UTF-8'. If a specific web server infrastructure is able >to support different encodings, then the WSGI adapter MAY provide a >way for a user of the WSGI adapter to customise on a global basis, or >on a per value basis what encoding is used, but this is entirely >optional. Note that there is no requirement to deal with RFC 2047. > >This is where I am going to diverge from what has been discussed before. > >The reason I am going to pass as UTF-8 and not latin-1 is that it >looks like Apache effectively only supports use of UTF-8. Since this >means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and >even CGI likely cannot handle anything besides UTF-8 then I really >can't see the point of trying to cater for a theoretical possibility >that some HTTP client could use something besides UTF-8. In other >words, the predominant case will be UTF-8, so let us target that. > >So, rather than burden every WSGI application with the need to convert >from latin-1 back to bytes and then to UTF-8, let the server deal with >it, with server using sensible default, and where server >infrastructure can handle a different encoding, then it can provide >option to use that encoding and WSGI application doesn't need to >change. Maybe I'm missing something here, but what if Apache receives something encoded in Latin-1? AFAIR, form POST encoding is determined by the encoding of the page containing the form; that's of course something that only happens in the input body, but what about URLs? Mainly I'm wondering, what should the server do in the event they receive a byte string which is not valid UTF-8? (Latin-1 doesn't have this problem, since there's no such thing as an invalid Latin-1 string, at least not at the encoding level.) >Also shown though that SCRIPT_NAME part has to be UTF-8 >and we would really be entering fantasy land if you were somehow going >to cope with some different encoding for PATH_INFO and QUERY_STRING. >Instead it is like the GPL, viral in nature. Use of UTF-8 in one >particular area means you are effectively bound to use UTF-8 >everywhere else. I'm not clear on your logic here. If I request foo/bar/baz (where baz actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the script, then the (accented) baz is legitimate for pass-through to the application, no? I just tried testing this with Firefox and Apache, and found that you can in fact pass such Latin-1 strings through to PATH_INFO, but at least in the case of Firefox, you have to %-escape them. However, they are seen by Python (via os.environ) as latin-1 encoded byte strings. >Further example of why UTF-8 reaches into everything is mod_rewrite >module for Apache. This allows you to do stuff related to SCRIPT_NAME, >PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache >configuration file has to be UTF-8. If URL isn't, then wouldn't be >possible to perform matches against non latin-1 characters in a >rewrite condition or rule. This is because your match string would be >in different encoded form to that in URL and so wouldn't match. Note that this still doesn't have any impact on the bytes that actually reach the application, which can be non-UTF8. At minimum, the proposal is underspecified as to how to handle this case, which is as trivial to generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s) of a URL. From graham.dumpleton at gmail.com Tue Aug 4 06:28:58 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 4 Aug 2009 14:28:58 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804040049.E8C563A4093@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> Message-ID: <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> 2009/8/4 P.J. Eby : >> 5. When running under Python 3, servers MUST provide CGI HTTP and >> server variables as strings. Where such values are sourced from a byte >> string, be that a Python byte string or C string, they should be >> converted as 'UTF-8'. If a specific web server infrastructure is able >> to support different encodings, then the WSGI adapter MAY provide a >> way for a user of the WSGI adapter to customise on a global basis, or >> on a per value basis what encoding is used, but this is entirely >> optional. Note that there is no requirement to deal with RFC 2047. >> >> This is where I am going to diverge from what has been discussed before. >> >> The reason I am going to pass as UTF-8 and not latin-1 is that it >> looks like Apache effectively only supports use of UTF-8. Since this >> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and >> even CGI likely cannot handle anything besides UTF-8 then I really >> can't see the point of trying to cater for a theoretical possibility >> that some HTTP client could use something besides UTF-8. In other >> words, the predominant case will be UTF-8, so let us target that. >> >> So, rather than burden every WSGI application with the need to convert >> from latin-1 back to bytes and then to UTF-8, let the server deal with >> it, with server using sensible default, and where server >> infrastructure can handle a different encoding, then it can provide >> option to use that encoding and WSGI application doesn't need to >> change. > > Maybe I'm missing something here, but what if Apache receives something > encoded in Latin-1? ?AFAIR, form POST encoding is determined by the encoding > of the page containing the form; that's of course something that only > happens in the input body, but what about URLs? > > Mainly I'm wondering, what should the server do in the event they receive a > byte string which is not valid UTF-8? ?(Latin-1 doesn't have this problem, > since there's no such thing as an invalid Latin-1 string, at least not at > the encoding level.) Can you clarify. We aren't talking about request content here. The wsgi.input stream is still binary and up to WSGI application to decode how it decides it should be decoded. The only related thing I can think you are talking about is the form target URL, which is an issue for GET and POST requests, or other method types, from a form. >> Also shown though that SCRIPT_NAME part has to be UTF-8 >> and we would really be entering fantasy land if you were somehow going >> to cope with some different encoding for PATH_INFO and QUERY_STRING. >> Instead it is like the GPL, viral in nature. Use of UTF-8 in one >> particular area means you are effectively bound to use UTF-8 >> everywhere else. > > I'm not clear on your logic here. ?If I request foo/bar/baz (where baz > actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the > script, then the (accented) baz is legitimate for pass-through to the > application, no? Technically, but what I am pointing out is that Apache pretty well says that foo/bar needs to be UTF-8. If you are going to have different parts of the one URL needing a different encoding to be understood, personally I would say you asking for trouble. So, am saying that UTF-8 needs to really apply more for sake of sanity and portability. > I just tried testing this with Firefox and Apache, and found that you can in > fact pass such Latin-1 strings through to PATH_INFO, but at least in the > case of Firefox, you have to %-escape them. ?However, they are seen by > Python (via os.environ) as latin-1 encoded byte strings. By using % escapes you are in practice overriding the encoding that the browser may be applying to URL if given raw character? What happens if you were to paste the accented character direct into the browser URL bar? Browsers I have played with would normally automatically translate that as UTF-8 and send it as such, with % encoding as necessary. So I guess the problem is more where URLs are already % encoded when coming back as href or form action because they may be in an encoding incompatible with UTF-8 if it were to be clicked on. >> Further example of why UTF-8 reaches into everything is mod_rewrite >> module for Apache. This allows you to do stuff related to SCRIPT_NAME, >> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache >> configuration file has to be UTF-8. If URL isn't, then wouldn't be >> possible to perform matches against non latin-1 characters in a >> rewrite condition or rule. This is because your match string would be >> in different encoded form to that in URL and so wouldn't match. > > Note that this still doesn't have any impact on the bytes that actually > reach the application, which can be non-UTF8. ?At minimum, the proposal is > underspecified as to how to handle this case, which is as trivial to > generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s) > of a URL. The Apache server at least will decode those % escape sequence and I believe it is the result of that which is used in stuff like rewrite rule matches, not the raw URL. The only exception would be if rewrite rule explicit matched against REQUEST_URI variable which still contains % escape sequences. So if not in UTF-8, means effectively that you can't then match them with Apache rewrite rules then. Graham From graham.dumpleton at gmail.com Tue Aug 4 14:44:34 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 4 Aug 2009 22:44:34 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> Ian, know you have seen this before, but didn't realise you hadn't cc'd the list. I have added a new response to part 4 of what you originally sent that wasn't in first reply that went direct to you. 2009/8/4 Ian Bicking : > On Mon, Aug 3, 2009 at 7:38 PM, Graham > Dumpleton wrote: >> So, for WSGI 1.0 style of interface and Python 3.0, the following is >> what I was going to implement. >> >> 1. When running under Python 3, applications SHOULD produce bytes >> output, status line and headers. > > Sure. > >> This is effectively what we had before. The only difference is that >> clarify that the 'status line' values should also be bytes. This >> wasn't noted before. I had already updated the proposed WSGI 1.0 >> amendments page to mention this. >> >> 2. When running under Python 3, servers and gateways MUST accept >> strings for output, status line and headers. Such strings must be >> converted to bytes output using 'latin-1'. If string cannot be >> converted then is treated as an error. >> >> This is again what we had before except that mention 'status line' value. > > Sure. ?ASCII for the status would be acceptable, as I believe that is > an HTTP constraint. > >> 3. When running under Python 3, servers MUST provide wsgi.input as a >> binary (byte) input stream. >> >> No change here. > > Yep. > >> 4. When running under Python 3, servers MUST provide a text stream for >> wsgi.errors. In converting this to a byte stream for writing to a >> file, the default encoding would be applied. >> >> No real change here except to clarify that default encoding would >> apply. Use of default encoding though could be problematic if >> combining different WSGI components. This is because each WSGI >> component may have been developed on system with different default >> encoding and so one may expect to log characters that can't be written >> on a different setup. Not sure how you could solve that except to say >> people have default encoding be UTF-8 for portability. > > Sure. ?We might specify that the server should never give an encoding > error; it should use 'replace' or something to make sure it won't > fail. Maybe it should be specified what should happen when bytes are > received. ?I generally believe that error handling code should try to > be as robust as possible, so it shouldn't fail regardless of what it > is given. Not that it matters, but looks like that for Apache/mod_wsgi wsgi.errors should be an instance of io.TextIOWrapper wrapping internal mod_wsgi specific buffer object providing interface compatible with io.BufferedIOBase. If someone uses write() on wrapper with bytes it will fail: TypeError: write() argument 1 must be str, not bytes If someone use print() to output data, then bytes would be converted okay. That is: print(b'1234', file=environ['wsgi.errors']) yields: b'1234'. If 'replace' is used for errors, you do end up with data loss. Use of 'xmlcharrefreplace' at least preserves values as numbers, although for Apache at least, if use 'ascii' encoding, you get a bit of a mess as the backslashes get escaped again. \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 instead of original: \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 That is because Apache logging functions escape anything which isn't printable ASCII and in turn escapes backslash denoting escaped character. If use encoding of utf-8 instead, then byte values get passed and Apache logging functions then just escape the non printable bytes instead so all up looks nicer. \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 So for Apache/mod_wsgi at least, best thing to do seems to use 'replace' and 'utf-8' due to way that Apache error logging functions work. I guess the point from this is that possibly should specify that wsgi.errors should be an instance of io.TextIOWrapper. A specific implementation should not use 'strict', but use 'replace' or 'backslashreplace' as makes sense, dependent on what encoding it needs to use and how any underlying logging system it overlays works. The intent overall being to preserve as much of raw information as possible. >> 5. When running under Python 3, servers MUST provide CGI HTTP and >> server variables as strings. Where such values are sourced from a byte >> string, be that a Python byte string or C string, they should be >> converted as 'UTF-8'. If a specific web server infrastructure is able >> to support different encodings, then the WSGI adapter MAY provide a >> way for a user of the WSGI adapter to customise on a global basis, or >> on a per value basis what encoding is used, but this is entirely >> optional. Note that there is no requirement to deal with RFC 2047. > > Ugh. This is where I'm not happy with how WSGI 1 in Python 3 has been > treated. I think it should be bytes, just like it is in Python 2. I still don't understand what is the practical, vs theoretical use case for that in Python 3. In Python 2 bytes strings work out okay because url routing rules through whatever means is generally also going to be defined in terms of byte strings. In Python 3 however, routing is going to likely default to being defined with strings and as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING are going to have to almost immediately be converted to strings from bytes to apply routing rules anyway. Can you expand on what benefits come from and what practical use case would predominate that would mean that bytes would be the better option? > But if we have an encoding, I guess UTF8 is okay so long as it uses > PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part > PEP 383, and putting the encoding that was used into the environment, > makes transcoding doable. PEP 383 doesn't allow for transcoding > unless you keep track of the encoding used, so we have to store that > in the environment. Again, what practical use cases are there where transcoding would be necessary, especially if it was a requirement that the WSGI adapter/server at lowest level, if it makes sense for that server infrastructure, ie., can support something other than UTF-8, to provide an option to supply WSGI environ values, all or selected, interpreted as a different encoding? If the option is at the WSGI adapter/server level and managed at the point of original translation from bytes, then a WSGI application or middleware doesn't need to worry about it. As such, noting what encoding was used in the environment serves no purpose except for information purposes. Marking what encoding was used also would not necessarily be straight forward if the WSGI adapter/server provided a way of overriding encoding used for specific values, because one value for encoding indicator would not suffice. To allow experimentation with encoding of values, current mod_wsgi code allowed overriding of values on global or individual basis. This was done via an Apache directive, but as had to pass this information from main Apache worker process to mod_wsgi daemon process, did it in such a way that also visible to application for information purposes at this point. Was using convention as follows. # Override encoding for everything to UTF-8. mod_wsgi.variable_encoding: UTF-8 # Override encoding and pass raw byes for everything. mod_wsgi.variable_encoding: - # Override encoding of specific value to UTF-8. mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8 # Override encoding and pass raw bytes for specific value. mod_wsgi.variable_encoding.SCRIPT_NAME: - If default encoding used for everything, then no value passed at all. In respect of passing bytes for values, we get back to argument from past discussions as to what should be passed as bytes. Do you only do SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific variables such as REQUEST_URI? What about headers such as Referrer? What about custom user values set using something like SetEnv directive in Apache? This is where it started to turn into a can of worms last time. You either treat everything as UTF-8 to be consistent, or use bytes for everything, in which case a great deal more work is put onto WSGI applications even for potentially simple stuff, effectively forcing the use of high level request wrappers like WebOb or request object in Werkzeug. In summary, what are the practical uses cases that would make passing bytes over UTF-8 or even latin-1 worthwhile? If passing bytes, what values should be passed as bytes and what left alone? What practical use cases are there that would necessitate transcoding? Some actual practical examples of stuff would very much help in this discussion as we tend to kee talking about what is theoretical possibilities rather than actual practice. Graham From robillard.etienne at gmail.com Tue Aug 4 17:12:50 2009 From: robillard.etienne at gmail.com (Etienne Robillard) Date: Tue, 04 Aug 2009 11:12:50 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> Message-ID: <4A784FF2.4020900@gmail.com> Graham Dumpleton wrote: > Ian, know you have seen this before, but didn't realise you hadn't > cc'd the list. I have added a new response to part 4 of what you > originally sent that wasn't in first reply that went direct to you. > > 2009/8/4 Ian Bicking : >> On Mon, Aug 3, 2009 at 7:38 PM, Graham >> Dumpleton wrote: >>> So, for WSGI 1.0 style of interface and Python 3.0, the following is >>> what I was going to implement. >>> >>> 1. When running under Python 3, applications SHOULD produce bytes >>> output, status line and headers. >> Sure. >> >>> This is effectively what we had before. The only difference is that >>> clarify that the 'status line' values should also be bytes. This >>> wasn't noted before. I had already updated the proposed WSGI 1.0 >>> amendments page to mention this. >>> >>> 2. When running under Python 3, servers and gateways MUST accept >>> strings for output, status line and headers. Such strings must be >>> converted to bytes output using 'latin-1'. If string cannot be >>> converted then is treated as an error. >>> >>> This is again what we had before except that mention 'status line' value. >> Sure. ASCII for the status would be acceptable, as I believe that is >> an HTTP constraint. >> >>> 3. When running under Python 3, servers MUST provide wsgi.input as a >>> binary (byte) input stream. >>> >>> No change here. >> Yep. >> >>> 4. When running under Python 3, servers MUST provide a text stream for >>> wsgi.errors. In converting this to a byte stream for writing to a >>> file, the default encoding would be applied. >>> >>> No real change here except to clarify that default encoding would >>> apply. Use of default encoding though could be problematic if >>> combining different WSGI components. This is because each WSGI >>> component may have been developed on system with different default >>> encoding and so one may expect to log characters that can't be written >>> on a different setup. Not sure how you could solve that except to say >>> people have default encoding be UTF-8 for portability. >> Sure. We might specify that the server should never give an encoding >> error; it should use 'replace' or something to make sure it won't >> fail. Maybe it should be specified what should happen when bytes are >> received. I generally believe that error handling code should try to >> be as robust as possible, so it shouldn't fail regardless of what it >> is given. > > Not that it matters, but looks like that for Apache/mod_wsgi > wsgi.errors should be an instance of io.TextIOWrapper wrapping > internal mod_wsgi specific buffer object providing interface > compatible with io.BufferedIOBase. If someone uses write() on wrapper > with bytes it will fail: > > TypeError: write() argument 1 must be str, not bytes > > If someone use print() to output data, then bytes would be converted > okay. That is: > > print(b'1234', file=environ['wsgi.errors']) > > yields: > > b'1234'. > > If 'replace' is used for errors, you do end up with data loss. Use of > 'xmlcharrefreplace' at least preserves values as numbers, although for > Apache at least, if use 'ascii' encoding, you get a bit of a mess as > the backslashes get escaped again. > > \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 > > instead of original: > > \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 > > That is because Apache logging functions escape anything which isn't > printable ASCII and in turn escapes backslash denoting escaped > character. > > If use encoding of utf-8 instead, then byte values get passed and > Apache logging functions then just escape the non printable bytes > instead so all up looks nicer. > > \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c > \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 > > So for Apache/mod_wsgi at least, best thing to do seems to use > 'replace' and 'utf-8' due to way that Apache error logging functions > work. > > I guess the point from this is that possibly should specify that > wsgi.errors should be an instance of io.TextIOWrapper. A specific > implementation should not use 'strict', but use 'replace' or > 'backslashreplace' as makes sense, dependent on what encoding it needs > to use and how any underlying logging system it overlays works. The > intent overall being to preserve as much of raw information as > possible. > >>> 5. When running under Python 3, servers MUST provide CGI HTTP and >>> server variables as strings. Where such values are sourced from a byte >>> string, be that a Python byte string or C string, they should be >>> converted as 'UTF-8'. If a specific web server infrastructure is able >>> to support different encodings, then the WSGI adapter MAY provide a >>> way for a user of the WSGI adapter to customise on a global basis, or >>> on a per value basis what encoding is used, but this is entirely >>> optional. Note that there is no requirement to deal with RFC 2047. >> Ugh. This is where I'm not happy with how WSGI 1 in Python 3 has been >> treated. I think it should be bytes, just like it is in Python 2. > > I still don't understand what is the practical, vs theoretical use > case for that in Python 3. In Python 2 bytes strings work out okay > because url routing rules through whatever means is generally also > going to be defined in terms of byte strings. In Python 3 however, > routing is going to likely default to being defined with strings and > as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING > are going to have to almost immediately be converted to strings from > bytes to apply routing rules anyway. > > Can you expand on what benefits come from and what practical use case > would predominate that would mean that bytes would be the better > option? > >> But if we have an encoding, I guess UTF8 is okay so long as it uses >> PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part >> PEP 383, and putting the encoding that was used into the environment, >> makes transcoding doable. PEP 383 doesn't allow for transcoding >> unless you keep track of the encoding used, so we have to store that >> in the environment. > > Again, what practical use cases are there where transcoding would be > necessary, especially if it was a requirement that the WSGI > adapter/server at lowest level, if it makes sense for that server > infrastructure, ie., can support something other than UTF-8, to > provide an option to supply WSGI environ values, all or selected, > interpreted as a different encoding? > > If the option is at the WSGI adapter/server level and managed at the > point of original translation from bytes, then a WSGI application or > middleware doesn't need to worry about it. As such, noting what > encoding was used in the environment serves no purpose except for > information purposes. Marking what encoding was used also would not > necessarily be straight forward if the WSGI adapter/server provided a > way of overriding encoding used for specific values, because one value > for encoding indicator would not suffice. > > To allow experimentation with encoding of values, current mod_wsgi > code allowed overriding of values on global or individual basis. This > was done via an Apache directive, but as had to pass this information > from main Apache worker process to mod_wsgi daemon process, did it in > such a way that also visible to application for information purposes > at this point. Was using convention as follows. > > # Override encoding for everything to UTF-8. > mod_wsgi.variable_encoding: UTF-8 > > # Override encoding and pass raw byes for everything. > mod_wsgi.variable_encoding: - > > # Override encoding of specific value to UTF-8. > mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8 > > # Override encoding and pass raw bytes for specific value. > mod_wsgi.variable_encoding.SCRIPT_NAME: - > > If default encoding used for everything, then no value passed at all. > > In respect of passing bytes for values, we get back to argument from > past discussions as to what should be passed as bytes. Do you only do > SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific > variables such as REQUEST_URI? What about headers such as Referrer? > What about custom user values set using something like SetEnv > directive in Apache? > > This is where it started to turn into a can of worms last time. You > either treat everything as UTF-8 to be consistent, or use bytes for > everything, in which case a great deal more work is put onto WSGI > applications even for potentially simple stuff, effectively forcing > the use of high level request wrappers like WebOb or request object in > Werkzeug. > > In summary, what are the practical uses cases that would make passing > bytes over UTF-8 or even latin-1 worthwhile? > > If passing bytes, what values should be passed as bytes and what left alone? > > What practical use cases are there that would necessitate transcoding? It's probably harder for newbies to understand transcoding, and converting bytes to string, and vice-versa. I think that count as a practical use case so that high-level frameworks can do some wrapping around, thus potentially making the WSGI spec significantly harder to implement in derivatives works. Thus, I'd not recommend to make WSGI 2 more obfuscated than necessary, unless supported by real-case scenarios as Graham suggested. Hoping not to have leaked too much fuel on the fire.. ;) Etienne -- Etienne Robillard Green Tea Hackers Club Blog: PGP Fingerprint: AED6 B33B B41D 5F4F A92A 2B71 874C FB27 F3A9 BDCC From pje at telecommunity.com Tue Aug 4 17:46:14 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 04 Aug 2009 11:46:14 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.co m> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> Message-ID: <20090804154654.D3AD93A4093@sparrow.telecommunity.com> At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote: >2009/8/4 P.J. Eby : > > I'm not clear on your logic here. If I request foo/bar/baz (where baz > > actually has an accent over the 'a') in latin-1 encoding, and > foo/bar is the > > script, then the (accented) baz is legitimate for pass-through to the > > application, no? > >Technically, but what I am pointing out is that Apache pretty well >says that foo/bar needs to be UTF-8. Which doesn't change the fact that you haven't yet proposed what a WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and QUERY_STRING. Apache can and does pass through such bytes, so the spec needs to say what we do with them. > If you are going to have >different parts of the one URL needing a different encoding to be >understood, personally I would say you asking for trouble. So, am >saying that UTF-8 needs to really apply more for sake of sanity and >portability. So what, precisely, are you proposing should happen when such bytes are present? >So I guess the problem is more where URLs are already % encoded when >coming back as href or form action because they may be in an encoding >incompatible with UTF-8 if it were to be clicked on. Yep, that's the case with "standard" browsers and servers; less-standard situations such as spiders and scripts generating or following URLs are also relevant, as are deliberate hack attempts. So having the result of this behavior be undefined is a bad thing. >The Apache server at least will decode those % escape sequence and I >believe it is the result of that which is used in stuff like rewrite >rule matches, not the raw URL. The only exception would be if rewrite >rule explicit matched against REQUEST_URI variable which still >contains % escape sequences. So if not in UTF-8, means effectively >that you can't then match them with Apache rewrite rules then. That's got nothing to do with what you propose for WSGI to do with the rest of it, though. (However, your belief may be incorrect in any event, as this page: http://www.dracos.co.uk/code/apache-rewrite-problem/ claims that mod_rewrite can RewriteCond on THE_REQUEST in order to match still-encoded paths.) From pje at telecommunity.com Tue Aug 4 18:05:13 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 04 Aug 2009 12:05:13 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.co m> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> Message-ID: <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: >In summary, what are the practical uses cases that would make passing >bytes over UTF-8 or even latin-1 worthwhile? My concern at this point is a nagging feeling that we are abandoning WSGI<->HTTP equivalence for convenience in the face of changes in Python's defaults. Had Python 3 been the standard version in existence when WSGI 1 was created, I would've argued for making *everything* bytes, in order to: 1. Force all encodings to be explicit, and 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects) And this is why the original spec said that Unicode strings should be treated as bytes -- because byte strings were always the original target of the spec. Please remember that WSGI is not primarily intended to provide application developers with a convenient API; its first and most important job is to ship the data around without mangling it in the process. HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, it would be good to *also* support strings on the application side, especially for application migration. However, I see no reason to make *servers* provide decoded strings instead of bytes. So I would ask, what is the practical use case for having the server decode bytes into strings, instead of leaving them as bytes? From ubernostrum at gmail.com Tue Aug 4 18:38:05 2009 From: ubernostrum at gmail.com (James Bennett) Date: Tue, 4 Aug 2009 11:38:05 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> Message-ID: <21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> On Tue, Aug 4, 2009 at 11:05 AM, P.J. Eby wrote: > 1. Force all encodings to be explicit, and This can be handled without forcing application authors to work with bytestrings (or forcing them to remember to coerce to bytestrings before returning responses). > 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python > objects) TBH, WSGI doesn't expose enough of HTTP's functionality to convince me that this is a good argument. When I can use advanced HTTP features (chunked transfer and friends) from a WSGI app, maybe I'll feel differently. > Please remember that WSGI is not primarily intended to provide application > developers with a convenient API; its first and most important job is to > ship the data around without mangling it in the process. Which it should try very hard to do without forcing *in*convenient APIs onto developers. > So I would ask, what is the practical use case for having the server decode > bytes into strings, instead of leaving them as bytes? Well, Django (for one example) already does some gymnastics to ensure that character encoding issues are kept at the request/response boundary, largely because it's an utter pain for an application developer to have an API dump a bunch of bytestrings in your lap and say "here, *you* figure it out". I suspect we're going to keep on doing that, since it's a big win in terms of usability for application developers (who end up having to deal with only a drastically-reduced subset of character-encoding problems). -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From jim at zope.com Tue Aug 4 19:17:07 2009 From: jim at zope.com (Jim Fulton) Date: Tue, 4 Aug 2009 13:17:07 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> Message-ID: <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote: > At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: >> >> In summary, what are the practical uses cases that would make passing >> bytes over UTF-8 or even latin-1 worthwhile? > > My concern at this point is a nagging feeling that we are abandoning > WSGI<->HTTP equivalence for convenience in the face of changes in Python's > defaults. ?Had Python 3 been the standard version in existence when WSGI 1 > was created, I would've argued for making *everything* bytes, in order to: > > 1. Force all encodings to be explicit, and > 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python > objects) > > And this is why the original spec said that Unicode strings should be > treated as bytes -- because byte strings were always the original target of > the spec. > > Please remember that WSGI is not primarily intended to provide application > developers with a convenient API; its first and most important job is to > ship the data around without mangling it in the process. > > HTTP moves bytes, therefore WSGI should move bytes. ?For practical reasons, > it would be good to *also* support strings on the application side, > especially for application migration. ?However, I see no reason to make > *servers* provide decoded strings instead of bytes. +1 I haven't had enough time to follow this and earlier encoding discussions and so haven't commented up to now, but I've always been uncomfortable with WSGI using anything but bytes or assuming any encoding. I agree that application frameworks should deal with conversion between bytes and unicode. Jim -- Jim Fulton From foom at fuhm.net Tue Aug 4 18:54:44 2009 From: foom at fuhm.net (James Y Knight) Date: Tue, 4 Aug 2009 12:54:44 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> Message-ID: On Aug 4, 2009, at 12:38 PM, James Bennett wrote: > TBH, WSGI doesn't expose enough of HTTP's functionality to convince me > that this is a good argument. When I can use advanced HTTP features > (chunked transfer and friends) from a WSGI app, maybe I'll feel > differently. But that works just fine today. Your WSGI app sends streaming data back using the iterator functionality, and the server automatically turns it into chunks if it's talking to an HTTP 1.1 client. What's the problem? James From ianb at colorstudy.com Tue Aug 4 19:30:32 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 4 Aug 2009 12:30:32 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> Message-ID: On Mon, Aug 3, 2009 at 11:28 PM, Graham Dumpleton wrote: >> Mainly I'm wondering, what should the server do in the event they receive a >> byte string which is not valid UTF-8? ?(Latin-1 doesn't have this problem, >> since there's no such thing as an invalid Latin-1 string, at least not at >> the encoding level.) > > Can you clarify. We aren't talking about request content here. The > wsgi.input stream is still binary and up to WSGI application to decode > how it decides it should be decoded. You could receive something like GET /fran%E7ais which if you do: urllib.unquote('/fran%E7ais').decode('utf8') you will get an error. So what should the server do? Obviously anyone at any time can embed in a document, and the browser is not going to try to figure out that encoding, it's just going to follow that URL. >From my testing (in http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py) the browser will be consistent about UTF8 when it does the encoding itself; but it doesn't necessarily do the encoding itself. QUERY_STRING will *not* necessarily be UTF8, even when the path is UTF8 (but this doesn't matter for us, because QUERY_STRING doesn't get url-decoded, so it's just ASCII with %-encoding). > The only related thing I can think you are talking about is the form > target URL, which is an issue for GET and POST requests, or other > method types, from a form. > >>> Also shown though that SCRIPT_NAME part has to be UTF-8 >>> and we would really be entering fantasy land if you were somehow going >>> to cope with some different encoding for PATH_INFO and QUERY_STRING. >>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one >>> particular area means you are effectively bound to use UTF-8 >>> everywhere else. >> >> I'm not clear on your logic here. ?If I request foo/bar/baz (where baz >> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the >> script, then the (accented) baz is legitimate for pass-through to the >> application, no? > > Technically, but what I am pointing out is that Apache pretty well > says that foo/bar needs to be UTF-8. If you are going to have > different parts of the one URL needing a different encoding to be > understood, personally I would say you asking for trouble. So, am > saying that UTF-8 needs to really apply more for sake of sanity and > portability. Apache's limitations can't be encoded into WSGI. Yes, it won't work with Apache (I guess, though with ProxyPass / or something, is this a problem?) -- but the idea of mapping request paths to files has nothing to do with WSGI. >> I just tried testing this with Firefox and Apache, and found that you can in >> fact pass such Latin-1 strings through to PATH_INFO, but at least in the >> case of Firefox, you have to %-escape them. ?However, they are seen by >> Python (via os.environ) as latin-1 encoded byte strings. > > By using % escapes you are in practice overriding the encoding that > the browser may be applying to URL if given raw character? What > happens if you were to paste the accented character direct into the > browser URL bar? Browsers I have played with would normally > automatically translate that as UTF-8 and send it as such, with % > encoding as necessary. Correct; the browser encodes non-ASCII characters as UTF8, but does not try to inspect the encoding of already %-encoded characters. > So I guess the problem is more where URLs are already % encoded when > coming back as href or form action because they may be in an encoding > incompatible with UTF-8 if it were to be clicked on. > >>> Further example of why UTF-8 reaches into everything is mod_rewrite >>> module for Apache. This allows you to do stuff related to SCRIPT_NAME, >>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache >>> configuration file has to be UTF-8. If URL isn't, then wouldn't be >>> possible to perform matches against non latin-1 characters in a >>> rewrite condition or rule. This is because your match string would be >>> in different encoded form to that in URL and so wouldn't match. >> >> Note that this still doesn't have any impact on the bytes that actually >> reach the application, which can be non-UTF8. ?At minimum, the proposal is >> underspecified as to how to handle this case, which is as trivial to >> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s) >> of a URL. > > The Apache server at least will decode those % escape sequence and I > believe it is the result of that which is used in stuff like rewrite > rule matches, not the raw URL. The only exception would be if rewrite > rule explicit matched against REQUEST_URI variable which still > contains % escape sequences. So if not in UTF-8, means effectively > that you can't then match them with Apache rewrite rules then. From tseaver at palladion.com Tue Aug 4 19:41:52 2009 From: tseaver at palladion.com (Tres Seaver) Date: Tue, 04 Aug 2009 13:41:52 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jim Fulton wrote: > On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote: >> At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: >>> In summary, what are the practical uses cases that would make passing >>> bytes over UTF-8 or even latin-1 worthwhile? >> My concern at this point is a nagging feeling that we are abandoning >> WSGI<->HTTP equivalence for convenience in the face of changes in Python's >> defaults. Had Python 3 been the standard version in existence when WSGI 1 >> was created, I would've argued for making *everything* bytes, in order to: >> >> 1. Force all encodings to be explicit, and >> 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python >> objects) >> >> And this is why the original spec said that Unicode strings should be >> treated as bytes -- because byte strings were always the original target of >> the spec. >> >> Please remember that WSGI is not primarily intended to provide application >> developers with a convenient API; its first and most important job is to >> ship the data around without mangling it in the process. >> >> HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, >> it would be good to *also* support strings on the application side, >> especially for application migration. However, I see no reason to make >> *servers* provide decoded strings instead of bytes. > > +1 > > I haven't had enough time to follow this and earlier encoding > discussions and so haven't commented up to now, but I've always been > uncomfortable with WSGI using anything but bytes or assuming any > encoding. I agree that application frameworks should deal with > conversion between bytes and unicode. +1 from me as well. The fact that Python3 now calls 'string' what used to be 'unicode' doesn't change the fact that "transport-level" operations have to be done in bytes. It should be the framework / application's job to handle conversion of byte inputs from the request onto strings, and string response fields onto bytes: ideally, the framework will do this in a way which keeps the application writer blissfully ignorant of the distinction. Note that I think Python3 gets the os.evniron bit wrong for exactly the same reasons: I think anybody wanting to use the environment-as-provided-by-the-OS should deal in bytes (or whatever the OS provides), with a convenience wrapper for those who don't care about the difference. I lost that argument, but that doesn't mean I was wrong. :) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKeHLg+gerLs4ltQ4RAiFjAJ9uZIkfxwh5w1aYiEdIpr+2yQ+iBwCeJiFM eUfWBoPwyzwHThkMwd24SZE= =lod9 -----END PGP SIGNATURE----- From janssen at parc.com Tue Aug 4 19:30:01 2009 From: janssen at parc.com (Bill Janssen) Date: Tue, 4 Aug 2009 10:30:01 PDT Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804154654.D3AD93A4093@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> <20090804154654.D3AD93A4093@sparrow.telecommunity.com> Message-ID: <59633.1249407001@parc.com> P.J. Eby wrote: > At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote: > >2009/8/4 P.J. Eby : > > > I'm not clear on your logic here. If I request foo/bar/baz (where baz > > > actually has an accent over the 'a') in latin-1 encoding, and > > foo/bar is the > > > script, then the (accented) baz is legitimate for pass-through to the > > > application, no? > > > >Technically, but what I am pointing out is that Apache pretty well > >says that foo/bar needs to be UTF-8. > > Which doesn't change the fact that you haven't yet proposed what a > WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and > QUERY_STRING. Apache can and does pass through such bytes, so the > spec needs to say what we do with them. Particularly QUERY_STRING. The original thinking around urlencoded was that it was always Latin-1. You were supposed to use "multipart/form-data" for non-Latin-1 encodings. Long thread on www-talk circa 1994 about this. I think bytes are the safest way to go here. It would be nice if we could automagically detect the correct encoding, but there's no foolproof way of doing that. Bill From ubernostrum at gmail.com Tue Aug 4 20:08:03 2009 From: ubernostrum at gmail.com (James Bennett) Date: Tue, 4 Aug 2009 13:08:03 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> Message-ID: <21787a9f0908041108j7acf8e5drf45f39276911cfcd@mail.gmail.com> On Tue, Aug 4, 2009 at 11:54 AM, James Y Knight wrote: > But that works just fine today. Your WSGI app sends streaming data back > using the iterator functionality, and the server automatically turns it into > chunks if it's talking to an HTTP 1.1 client. What's the problem? No, it doesn't work just fine today. Either the server has to assume that every response from that application should be chunked (which is wrong), or the application needs a way to tell the server to chunk. Turns out HTTP has a way to indicate that, but WSGI outright forbids its use. So instead you have to invent out-of-band mechanisms for the application to tell the server what to do, and in the process reinvent part of HTTP. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From foom at fuhm.net Tue Aug 4 20:29:07 2009 From: foom at fuhm.net (James Y Knight) Date: Tue, 4 Aug 2009 14:29:07 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <21787a9f0908041108j7acf8e5drf45f39276911cfcd@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> <21787a9f0908041108j7acf8e5drf45f39276911cfcd@mail.gmail.com> Message-ID: On Aug 4, 2009, at 2:08 PM, James Bennett wrote: > the server has to assume > that every response from that application should be chunked (which is > wrong) I'd expect the server to chunk every response from the application which is returned as an iterable instead of a list. Why do you say that's wrong? James From nd at perlig.de Tue Aug 4 20:34:56 2009 From: nd at perlig.de (=?iso-8859-1?q?Andr=E9_Malo?=) Date: Tue, 4 Aug 2009 20:34:56 +0200 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <200908042034.56658.nd@perlig.de> * Graham Dumpleton wrote: > Now, the reason why Apache can't really handle anything besides UTF-8 > relates to how filenames are encoded in the file system. > > Taking Windows first as it is the more obvious case. What Apache does > there is take whatever path it has mapping to a script file, be it > constructed partially from what is in Apache configuration and > partially from what was supplied in URL from client, and converts it > to UCS2 for passing to Windows file system routines. In converting to > UCS2, Apache assumes that the path will be UTF-8. This means that the > Apache configuration file has to be UTF-8 and that the URL as supplied > by the client is UTF-8 as well after any URL character encoding is > decoded. End result, can only handle UTF-8. This is the only platform where the apache does that, actually, because it doesn't work any other way on windows (everything is passed to the system as ucs-2). So I wouldn't call that "apache requires utf-8 everywhere". If I would care, I would even make it configurable on windows, but I don't ;) [...] nd From nd at perlig.de Tue Aug 4 20:38:15 2009 From: nd at perlig.de (=?iso-8859-1?q?Andr=E9_Malo?=) Date: Tue, 4 Aug 2009 20:38:15 +0200 Subject: [Web-SIG] WSGI 2 In-Reply-To: <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> References: <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> Message-ID: <200908042038.16006.nd@perlig.de> * Jim Fulton wrote: > On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote: > > > > HTTP moves bytes, therefore WSGI should move bytes. ?For practical > > reasons, it would be good to *also* support strings on the application > > side, especially for application migration. ?However, I see no reason > > to make *servers* provide decoded strings instead of bytes. > > +1 > > I haven't had enough time to follow this and earlier encoding > discussions and so haven't commented up to now, but I've always been > uncomfortable with WSGI using anything but bytes or assuming any > encoding. I agree that application frameworks should deal with > conversion between bytes and unicode. Another +1 from the peanut gallery. nd From fumanchu at aminus.org Tue Aug 4 21:10:09 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Tue, 4 Aug 2009 12:10:09 -0700 Subject: [Web-SIG] WSGI 2 References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com><88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com><20090804160516.9BE7D3A411E@sparrow.telecommunity.com><21787a9f0908040938t36a6d3dbla7dcb05844c3ba99@mail.gmail.com> <21787a9f0908041108j7acf8e5drf45f39276911cfcd@mail.gmail.com> Message-ID: James Bennett wrote: > On Tue, Aug 4, 2009 at 11:54 AM, James Y Knight wrote: >> But that works just fine today. Your WSGI app sends streaming data back >> using the iterator functionality, and the server automatically turns it into >> chunks if it's talking to an HTTP 1.1 client. What's the problem? > > No, it doesn't work just fine today. Either the server has to assume > that every response from that application should be chunked (which is > wrong), or the application needs a way to tell the server to chunk. > Turns out HTTP has a way to indicate that, but WSGI outright forbids > its use. So instead you have to invent out-of-band mechanisms for the > application to tell the server what to do, and in the process reinvent > part of HTTP. It doesn't have to be out of band; CherryPy's wsgiserver will send a response chunked if the application provides no Content-Length response header. if status == 413: # Request Entity Too Large. Close conn to avoid garbage. self.close_connection = True elif "content-length" not in hkeys: # "All 1xx (informational), 204 (no content), # and 304 (not modified) responses MUST NOT # include a message-body." So no point chunking. if status < 200 or status in (204, 205, 304): pass else: if self.response_protocol == 'HTTP/1.1': # Use the chunked transfer-coding self.chunked_write = True self.outheaders.append(("Transfer-Encoding", "chunked")) else: # Closing the conn is the only way to determine len. self.close_connection = True Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Wed Aug 5 02:53:14 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 5 Aug 2009 10:53:14 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804154654.D3AD93A4093@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> <20090804154654.D3AD93A4093@sparrow.telecommunity.com> Message-ID: <88e286470908041753j7c125227g8cb2387c7667ed52@mail.gmail.com> 2009/8/5 P.J. Eby : > So what, precisely, are you proposing should happen when such bytes are > present? Treat me as a business manager who has read just enough IT magazines to be dangerous. As I have said before in prior discussions, my area is C coding and trying to implement a Python hosting solution for Apache, I do not know all the intricacies of HTTP and web application development as I don't write web applications. That is why I defer to you guys to come up with a workable specification. If I don't see anything sensible coming back in the way of a proposal, I will try an suggest my own, but because of my lack of knowledge it isn't necessarily going to be right. In this respect, just pushing it back on me isn't particularly helpful from my perspective. If you think something is outright wrong and not going to work, then come back back with an overall solution which is going to work. So far no one else has come back with an overall solution that works and everyone is happy with and I seem to be the only one truly interested in progressing this. As such it is really frustrating. Now, the main reason why I am throwing around alternate suggestions in the first place is that last time although people seem to be comfortable moving along with the idea of latin-1 everywhere, I knew of some who weren't happy with that, some not on the list, and who believed it should be bytes, but they weren't speaking up. In this discussion people are being more vocal about bytes being the way to go and I am quite happy with that, we just need to flesh out the various problems from going that way. So, let us put aside UTF-8 as a workable solution for Python and focus then on bytes instead. We also need to address other comments by people about whether status and headers values in response should come back as bytes or strings to allow predictability for WSGI middleware. The questions around use of bytes in my mind are: 1. Should the values of all CGI variables be bytes or just a subset of them? If a subset, which ones? Note am presuming here the name of header, ie., key, will be a string and only value will be bytes. Is that even a correct assumption? 2. How would use of bytes work for a CGI-WSGI bridge given that os.environ is not bytes? Where does one get what encoding was used for os.environ values so it can be converted back to bytes? 3. What are the rules about WSGI middleware in respect of preservation of values as bytes? I can see too easily that people will convert SCRIPT_NAME and PATH_INFO to string to do stuff with and change them and then not convert them back to bytes if environ is modified with new values. The rules would have to be clearly specified. We then have the issues others have raised about response. 4. Should there be a choice about a WSGI application/middleware returning bytes or a string which is automatically converted to bytes per latin-1? If no choice, which should required to be returned, bytes or strings? So, lets focus on these issues instead then and any others that people have in relation to bytes or how responses are returned and so explore that option. Graham From foom at fuhm.net Wed Aug 5 03:48:30 2009 From: foom at fuhm.net (James Y Knight) Date: Tue, 4 Aug 2009 21:48:30 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908041753j7c125227g8cb2387c7667ed52@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> <20090804154654.D3AD93A4093@sparrow.telecommunity.com> <88e286470908041753j7c125227g8cb2387c7667ed52@mail.gmail.com> Message-ID: <215B0C4D-16B1-4AC4-834F-F1C7CA03F201@fuhm.net> On Aug 4, 2009, at 8:53 PM, Graham Dumpleton wrote: > 2. How would use of bytes work for a CGI-WSGI bridge given that > os.environ is not bytes? Where does one get what encoding was used for > os.environ values so it can be converted back to bytes? On Unix it's simple enough: On py2.X on Unix: environ is bytes already. On py3.0: you're screwed, because some env vars were discarded already. On py3.1+: 'string'.encode(sys.getfilesystemencoding(), 'surrogateescape') should do it. On Windows, I guess the OS environment is unicode, so, I don't know precisely what to do to reversibly obtain the bytes sent from the end- users's browser. It looks to me from source code as if Apache will encode the bytes from the client (utf-8 or otherwise!) as the Unicode values 0x00 to 0xFF in the windows environment, that is, as if decoding the client input in latin-1. But it does that for the following keys only: HTTP_* SERVER_* REQUEST_* QUERY_STRING PATH_INFO PATH_TRANSLATED (from http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/arch/win32/mod_win32.c) Other values are decoded from utf-8 (or, if passed through from an enclosing environment, passed through untouched -- via encoding into utf-8 for internal use and then decoding back from utf-8 to put back in the Windows environment.) I'll note that while it's important to get this transformation correct for a CGI->WSGI bridge to work right in Windows, and thus is definitely a useful discussion to have here, it doesn't actually need to be part of the WSGI spec. James From pje at telecommunity.com Wed Aug 5 04:19:27 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 04 Aug 2009 22:19:27 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908041753j7c125227g8cb2387c7667ed52@mail.gmail.co m> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <20090804040049.E8C563A4093@sparrow.telecommunity.com> <88e286470908032128i3f0a3209h9d8959cf71ea89ba@mail.gmail.com> <20090804154654.D3AD93A4093@sparrow.telecommunity.com> <88e286470908041753j7c125227g8cb2387c7667ed52@mail.gmail.com> Message-ID: <20090805021932.13FAF3A4093@sparrow.telecommunity.com> At 10:53 AM 8/5/2009 +1000, Graham Dumpleton wrote: >Now, the main reason why I am throwing around alternate suggestions in >the first place is that last time although people seem to be >comfortable moving along with the idea of latin-1 everywhere, I knew >of some who weren't happy with that, some not on the list, and who >believed it should be bytes, but they weren't speaking up. I suspect that this was all a confusion to begin with; the primary function of Latin-1 in WSGI has been a way to represent bytes when all you have to represent them with is unicode strings. So, even when we've been talking Latin-1, what we really mean is bytes. ;-) In general, I think we want to require that servers must provide bytes, and accept both bytes and Latin-1 (maybe just ASCII?) strings. (I don't see a problem with environ keys being strings, though, since all the WSGI or CGI-defined keys are pure ASCII anyway. But I could just as easily go with "bytes everywhere"; I assume Py3 treats all-ascii byte strings and the equivalent unicode as being equal and hashing alike.) From randy at rcs-comp.com Wed Aug 5 21:45:48 2009 From: randy at rcs-comp.com (Randy Syring) Date: Wed, 05 Aug 2009 15:45:48 -0400 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> Message-ID: <4A79E16C.7060304@rcs-comp.com> Tres Seaver wrote: > > ideally, the > framework will do this in a way which keeps the application writer > blissfully ignorant of the distinction. As an application developer, I would like to agree with the above. I am going to rely on a good framework to handle a lot of these issues. It seems that a lot of the discussion, while over my head, assumes that application developers are going to be working directly with WSGI. Technically, that is possible, but I think you should remember that most application developers are going to rely on a framework to give them a usable API. My opinion, as an application developer, would be to keep WSGI as clean as possible and allow the frameworks to handle creating a good API that gives options for handling byte/character encoding issues. Its a lot easier to change/update a framework than a spec. Keep WSGI as simple as possible and let the frameworks manage the more complicated aspects of character encoding and clean APIs. Just my $0.02. -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 From wilk at flibuste.net Thu Aug 6 21:31:55 2009 From: wilk at flibuste.net (William Dode) Date: Thu, 6 Aug 2009 19:31:55 +0000 (UTC) Subject: [Web-SIG] WSGI 2 References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <1099b90b0908041017n2b9d8314sa2554da8e27b24fb@mail.gmail.com> Message-ID: >> HTTP moves bytes, therefore WSGI should move bytes. ?For practical reasons, >> it would be good to *also* support strings on the application side, >> especially for application migration. ?However, I see no reason to make >> *servers* provide decoded strings instead of bytes. +1 because anyway if (most of the time) an app decide to reject everything not utf-8 it'll be very easy. And if not it will be possible, specialy for old applications where we cannot upgrade the server and the client in the same time. -- William Dod? - http://flibuste.net Informaticien Ind?pendant From ubernostrum at gmail.com Tue Aug 11 04:11:36 2009 From: ubernostrum at gmail.com (James Bennett) Date: Mon, 10 Aug 2009 21:11:36 -0500 Subject: [Web-SIG] PEP 333 and gzipping of responses Message-ID: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> Earlier today I posted an article on my blog following up on some discussions of WSGI; one criticism presented was of language in PEP 333 regarding gzipping of responses by WSGI applications. Ian posted a comment which stated that the criticism was not correct, but I'm at a loss to figure out what *is* correct, so I'll bring up the question here. In a parenthetical at the end of the section entitled "Handling the Content-Length Header", PEP 333 states: > Note: applications and middleware must not apply any kind of > Transfer-Encoding to their output, such as chunking or gzipping; as > "hop-by-hop" operations, these encodings are the province of the > actual web server/gateway. See Other HTTP Features below, for more > details. In the section "Other HTTP Features", PEP 333 states, in part: > However, because WSGI servers and applications do not communicate > via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to > WSGI internal communications. WSGI applications must not generate > any "hop-by-hop" headers [4], attempt to use HTTP features that > would require them to generate such headers, or rely on the content > of any incoming "hop-by-hop" headers in the environ dictionary. My criticism of this is that this is at best ambiguous, and quite possibly openly misleading to readers of the PEP. The ambiguity here is that "gzip" is a valid value for the Transfer-Encoding header in HTTP (RFC 2616, Sections 3.6 and 14.41), but is also a valid value for the Content-Encoding header (RFC 2616, Sections 3.5 and 14.11). Web frameworks and libraries (in many languages, not just Python) which support gzipping of responses all seem to opt for the latter method. Additionally, Apache's mod_deflate -- which so far as I know is overwhelmingly the most common mechanism for enabling gzipping at the server level -- also opts for this method, and uses the Content-Encoding header. Given this, gzipping of responses seems to be rather universally associated, in the minds of web developers, with the Content-Encoding header, which is not a "hop-by-hop" header (RFC 2616, Section 13.5.1). As such, the immediate (and misleading) impression given to readers of PEP 333 will likely be one of: 1. PEP 333 forbids applications using Content-Encoding to signal gzipped response bodies (since it mentions gzipping as something applications specifically must not do), or 2. PEP 333 is ambiguous or contradictory on account of mentioning Transfer-Encoding and "hop-by-hop" headers in a context in which no-one uses Transfer-Encoding or a "hop-by-hop" header, or 3. This text in PEP 333 is based upon a misunderstanding of this feature of HTTP or of its use in the real world. None of these seem particularly good, and this is why I took that section of the spec to task (albeit in a much briefer and more cursory fashion, since this message is already starting to run a bit long). If I'm misreading or misunderstanding either PEP 333 or RFC 2616, I'd appreciate it if someone would explain where I've gone astray. But as it stands, I believe the text of PEP 333 quoted above is problematic and likely to lead to confusion, and (if I'm not misreading or misunderstanding it) should probably be revised to address these concerns. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From ianb at colorstudy.com Tue Aug 11 06:50:22 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 10 Aug 2009 23:50:22 -0500 Subject: [Web-SIG] PEP 333 and gzipping of responses In-Reply-To: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> References: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> Message-ID: On Mon, Aug 10, 2009 at 9:11 PM, James Bennett wrote: > Earlier today I posted an article on my blog following up on some > discussions of WSGI; one criticism presented was of language in PEP > 333 regarding gzipping of responses by WSGI applications. Ian posted a > comment which stated that the criticism was not correct, but I'm at a > loss to figure out what *is* correct, so I'll bring up the question > here. > > In a parenthetical at the end of the section entitled "Handling the > Content-Length Header", PEP 333 states: > > > Note: applications and middleware must not apply any kind of > > Transfer-Encoding to their output, such as chunking or gzipping; as > > "hop-by-hop" operations, these encodings are the province of the > > actual web server/gateway. See Other HTTP Features below, for more > > details. > > In the section "Other HTTP Features", PEP 333 states, in part: > > > However, because WSGI servers and applications do not communicate > > via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to > > WSGI internal communications. WSGI applications must not generate > > any "hop-by-hop" headers [4], attempt to use HTTP features that > > would require them to generate such headers, or rely on the content > > of any incoming "hop-by-hop" headers in the environ dictionary. > > My criticism of this is that this is at best ambiguous, and quite > possibly openly misleading to readers of the PEP. > > The ambiguity here is that "gzip" is a valid value for the > Transfer-Encoding header in HTTP (RFC 2616, Sections 3.6 and 14.41), > but is also a valid value for the Content-Encoding header (RFC 2616, > Sections 3.5 and 14.11). > I just don't get the confusion. Transfer-Encoding is not allowed in WSGI (a hop-by-hop header, like several other Transfer-* headers). Content-Encoding is allowed, because everything not specifically mentioned is allowed. Clearly "Content-Encoding" and "Transfer-Encoding" are different strings. And, as you mention, the normal thing that people currently do is use Content-Encoding anyway, so since people aren't using Transfer-Encoding, why is this controversial? There are some weird implications to using Content-Encoding, specifically ETags and range requests, but eh... those exist in mod_deflate and just about everywhere, and are mostly outside the scope of WSGI. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From foom at fuhm.net Tue Aug 11 07:22:22 2009 From: foom at fuhm.net (James Y Knight) Date: Tue, 11 Aug 2009 01:22:22 -0400 Subject: [Web-SIG] PEP 333 and gzipping of responses In-Reply-To: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> References: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> Message-ID: <8AF2E981-A5F6-4F51-BC58-A3F1244253DE@fuhm.net> On Aug 10, 2009, at 10:11 PM, James Bennett wrote: > Earlier today I posted an article on my blog following up on some > discussions of WSGI I find it a bit odd that you again claim WSGI doesn't support chunked transfers after that was thoroughly explained here, already. And add to that false claims about forbidding Content-Encoding, strange claims about its character support being insufficient....I'm getting the feeling that you don't actually understand HTTP. HTTP really *is* hard, but WSGI didn't screw it up, you just seem to misunderstand either what WSGI allows or else what is correct with regards to HTTP. I'd tend to agree with much of what you wrote in the last two sections of your post, but the first section is just completely confused and wrong. James From graham.dumpleton at gmail.com Tue Aug 11 07:54:25 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 11 Aug 2009 15:54:25 +1000 Subject: [Web-SIG] PEP 333 and gzipping of responses In-Reply-To: <8AF2E981-A5F6-4F51-BC58-A3F1244253DE@fuhm.net> References: <21787a9f0908101911y5c15f011gfb410ce57e212b58@mail.gmail.com> <8AF2E981-A5F6-4F51-BC58-A3F1244253DE@fuhm.net> Message-ID: <88e286470908102254q66868da8w3d5d7d552a5f0a13@mail.gmail.com> 2009/8/11 James Y Knight : > On Aug 10, 2009, at 10:11 PM, James Bennett wrote: >> >> Earlier today I posted an article on my blog following up on some >> discussions of WSGI > > I find it a bit odd that you again claim WSGI doesn't support chunked > transfers after that was thoroughly explained here, already. WSGI applications themselves shouldn't deal with chunked transfer encoding. In other words, for a response, a WSGI application should not format a response in chunked form as per HTTP specification. This doesn't though stop the underlying web server from doing that where no content length is supplied, but that is nothing to do with WSGI and a completely separate concern only relevant to the web server layer. In other words, out of scope of the WSGI specification. Robert has already indicated that web server underlying CherryPy WSGI server does this and I can say that Apache also does that, so mod_wsgi also by virtue of that can generate chunked response content, albeit that it isn't actually a feature of mod_wsgi. As for request content, it is also the concern of the underlying web server and not the WSGI application. That said, the way the WSGI specification is drafted makes it impossible for a WSGI application to handle a request which uses chunked content directly. This is because wsgi.input isn't required to use an empty string as end of input sentinel. This means one cannot just read until all request content is exhausted. Instead, it is required to rely on CONTENT_LENGTH to determine how much an application can actually read. With chunked request content though, there is no CONTENT_LENGTH. The WSGI specification follows CGI though and so if CONTENT_LENGTH is not supplied you are supposed to assume that CONTENT_LENGTH is 0. As such, there is no way to indicate that input can be present but is of unknown length and so chunked request content cannot be handled directly by a WSGI compliant application. In the web server that underlies CherryPy WSGI server, Robert tries to address this by reading in all input for chunked request up front and determining CONTENT_LENGTH before passing it to the WSGI application. This prohibits WSGI application from directly streaming request content and leads into issues about what to do if request content is large. If WSGI application is streaming it itself, it could determine that it should halt if finding more than it wants to deal with. By doing that in web server though, WSGI application doesn't have that level of control. In Apache/mod_wsgi, for <3.0 it will reject chunked requests outright. In 3.0+ you will be able to optionally specify a directive which will allow chunked request content, but you have to consciously step outside of bounds of WSGI and ignore CONTENT_LENGTH and instead read to end of input if you want to handle chunked request content. Thus, your application wouldn't be WSGI compliant. Some number of users accept this though, as it is the only way to handle uploads from some mobile phones, which use chunked request content for large uploads. This issue of there being no way to handle content of unknown length also means you cannot have mutating input filters. This means you cannot use compression on request content and use mod_deflate in Apache to uncompress it as the resulting content will normally be of different length to that specified by CONTENT_LENGTH, which will be the compressed length. Now, I have described CherryPy WSGI server as being layered, ie., web server and then WSGI adapter. I know that it may not be that clear cut and they are one in the same, but logically, there is a split, even if the code is much intertwined. I am sure Robert will correct me if my understanding is wrong. :-) Graham From henry at precheur.org Wed Aug 12 00:14:03 2009 From: henry at precheur.org (Henry Precheur) Date: Tue, 11 Aug 2009 15:14:03 -0700 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> Message-ID: <20090811221403.GA18509@wrap.novuscom.net> Using bytes for all `environ` values is easy to understand on the application side as long as you are aware of the encoding problem. The cost is inconvenience, but that's probably OK. It's also simpler to implement on the gateway/server side. By choosing bytes, WSGI passes the encoding problem to the application, which is good. Let's the application deal with that. It's more likely to know what it needs, and what problem it can ignore. I think that 99% of the time, applications will just decode bytes to string using UTF-8, ignoring invalid values. However it's likely that we'll see middlewares converting ALL environment values to UTF-8, because it's more convienient than using bytes. And some middlewares might depend on `environ` values being string instead of bytes, because it's convenient too. This issue was already raised by Graham. And I think it's important to make it clear. I believe that 'server/CGI' values in the environment shouldn't be modified--Of course it should still be possible to add new values. This way the stack will always remain in a 'sane' state. For example if a middleware wants to convert environ values to UTF-8, it shouldn't do that: > for key, value in environ.items(): > environ[key] = str(value) But something like this--assuming there's only bytes in `environ`: > environ['unicode.environ'] = dict((key, str(value, encoding='utf8')) > for key, value in environ.items()) I'm in favor of using bytes everywhere. But it's important to document why bytes are used and how to use them. I'm not sure this should be included in a PEP, maybe a "WSGI best practices"? Cheers, -- Henry Pr?cheur From graham.dumpleton at gmail.com Wed Aug 12 01:25:21 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 12 Aug 2009 09:25:21 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090811221403.GA18509@wrap.novuscom.net> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <20090811221403.GA18509@wrap.novuscom.net> Message-ID: <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> 2009/8/12 Henry Precheur : > Using bytes for all `environ` values is easy to understand on the > application side as long as you are aware of the encoding problem. The > cost is inconvenience, but that's probably OK. It's also simpler to > implement on the gateway/server side. Use of bytes everywhere can be inconvenient on the gateway/server side, at least as far as end result for user. The specific problem is that WSGI environment is used to hold information about the original request, as CGI variables, but also can hold user specified custom variables. In the case of anything hosted via Apache, such as through mod_wsgi, mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such custom variables using the SetEnv directive. Thus one might say: SetEnv trac.env_path /usr/local/trac/site-1 If the rule is that everything in WSGI environment coming from WSGI adapter must be bytes then you have a potential for mismatch in expectations of how values will be passed. That is, if set using SetEnv then would be bytes, but if set using WSGI middleware wrapper for configuration, more likely going to be string. It would seem overly onerous to expect WSGI middleware to use bytes for configuration variables as well and so force all consumers to always be converting to string using appropriate encoding, where required encoding potentially unknown. The underlying problem here is in part, albeit maybe from convention, that there is a single dictionary for both request information and user configuration. It isn't though a simple matter of splitting them either so that request information is always separate. This is because for FASTCGI, SCGI, CGI, you can't split them as only one grouping in those cases. This is why I specifically asked previously, and which no one has answered, if bytes is to be used, which variables in WSGI environment should be passed as bytes. If there is a known specified list of variables which it is known will always be bytes, may be more manageable. If someone is going to suggest that only CGI variables should be bytes, then what does that actually mean. Remember that for FASTCGI, SCGI, CGI there isn't really a distinction and so where the boundary is as to what is a CGI variable is fuzzy although you could reverse transformation and get back bytes if know what to do it for. One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and QUERY_STRING and maybe that will suffice. It may not though, because what about headers such as HTTP_REFERRER? Also, what about additional SSL_? variables that a SSL module for web sever may add? Graham > By choosing bytes, WSGI passes the encoding problem to the application, > which is good. Let's the application deal with that. It's more likely to > know what it needs, and what problem it can ignore. I think that 99% of > the time, applications will just decode bytes to string using UTF-8, > ignoring invalid values. > > However it's likely that we'll see middlewares converting ALL > environment values to UTF-8, because it's more convienient than using > bytes. And some middlewares might depend on `environ` values being > string instead of bytes, because it's convenient too. > > > This issue was already raised by Graham. And I think it's important to > make it clear. I believe that 'server/CGI' values in the environment > shouldn't be modified--Of course it should still be possible to add new > values. This way the stack will always remain in a 'sane' state. > > For example if a middleware wants to convert environ values to UTF-8, it > shouldn't do that: > >> ? for key, value in environ.items(): >> ? ? ? environ[key] = str(value) > > But something like this--assuming there's only bytes in `environ`: > >> ? environ['unicode.environ'] = dict((key, str(value, encoding='utf8')) >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? for key, value in environ.items()) > > I'm in favor of using bytes everywhere. But it's important to document > why bytes are used and how to use them. I'm not sure this should be > included in a PEP, maybe a "WSGI best practices"? > > > Cheers, > > -- > ?Henry Pr?cheur > From ianb at colorstudy.com Wed Aug 12 02:18:09 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 11 Aug 2009 19:18:09 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <20090811221403.GA18509@wrap.novuscom.net> <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> Message-ID: On Tue, Aug 11, 2009 at 6:25 PM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > 2009/8/12 Henry Precheur : > > Using bytes for all `environ` values is easy to understand on the > > application side as long as you are aware of the encoding problem. The > > cost is inconvenience, but that's probably OK. It's also simpler to > > implement on the gateway/server side. > > Use of bytes everywhere can be inconvenient on the gateway/server > side, at least as far as end result for user. > > The specific problem is that WSGI environment is used to hold > information about the original request, as CGI variables, but also can > hold user specified custom variables. > > In the case of anything hosted via Apache, such as through mod_wsgi, > mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such > custom variables using the SetEnv directive. Thus one might say: > > SetEnv trac.env_path /usr/local/trac/site-1 > Just to clarify, there specifically is no type restrictions on extension variables, which is any variable with a "." in it. The type restrictions are solely for ALL_CAPS keys. You can put ints or unicode or whatever in other variables. (Probably this doesn't make things any easier for mod_wsgi, though; at least for this example) -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Wed Aug 12 02:40:26 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 12 Aug 2009 10:40:26 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <20090811221403.GA18509@wrap.novuscom.net> <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> Message-ID: <88e286470908111740q5319c7f8r7387f598aedfa732@mail.gmail.com> 2009/8/12 Ian Bicking : > On Tue, Aug 11, 2009 at 6:25 PM, Graham Dumpleton > wrote: >> >> 2009/8/12 Henry Precheur : >> > Using bytes for all `environ` values is easy to understand on the >> > application side as long as you are aware of the encoding problem. The >> > cost is inconvenience, but that's probably OK. It's also simpler to >> > implement on the gateway/server side. >> >> Use of bytes everywhere can be inconvenient on the gateway/server >> side, at least as far as end result for user. >> >> The specific problem is that WSGI environment is used to hold >> information about the original request, as CGI variables, but also can >> hold user specified custom variables. >> >> In the case of anything hosted via Apache, such as through mod_wsgi, >> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such >> custom variables using the SetEnv directive. Thus one might say: >> >> ?SetEnv trac.env_path /usr/local/trac/site-1 > > Just to clarify, there specifically is no type restrictions on extension > variables, which is any variable with a "." in it. ?The type restrictions > are solely for ALL_CAPS keys. ?You can put ints or unicode or whatever in > other variables. ?(Probably this doesn't make things any easier for > mod_wsgi, though; at least for this example) If you want to change what the specification says from: """Finally, the environ dictionary may also contain server-defined variables. These variables should be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway.""" to: """Finally, the environ dictionary may also contain server-defined variables. These variables MUST be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway.""" then it is part the way as it least one is drawing a line between what is being construed as CGI variable and so would be bytes, and adapter/application variables which would be converted to string in what ever encoding makes sense for the server configuration system, with in the case of Apache would be UTF-8. The above description though would also have to be changed though, in as much as at the moment it says: """should be prefixed with a name that is unique to the defining server or gateway""" This isn't really in practice correct as the server configuration is just providing the mechanism for setting them and they may not necessarily be server or gateway variables, but variables a user is setting to customise the behaviour of the application. The way I read that line, strictly speaking, even though set as: SetEnv trac.env_path /usr/local/trac/site-1 it should be passed through as: mod_wsgi.trac.env_path which would be rather silly. Thus description needs to cater for fact that application variables may be settable from server configuration and passed through as is. Anyway, if the rule is that anything in upper case is treated as CGI and passed as bytes, and anything in lower case isn't and is passed as string, appropriately decoded, then that would eliminate one confusion point as far as expectations. It may not make it any easier for CGI under Python 3.0 though, where values would be all strings anyway. Now, is anyone willing to address the problem pointed out by others about where being able to return either bytes or strings (latin-1) for response headers is a pain for WSGI middleware to deal with? Graham From henry at precheur.org Wed Aug 12 02:40:25 2009 From: henry at precheur.org (Henry Precheur) Date: Tue, 11 Aug 2009 17:40:25 -0700 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <20090811221403.GA18509@wrap.novuscom.net> <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> Message-ID: <20090812004025.GA29297@wrap.novuscom.net> On Wed, Aug 12, 2009 at 09:25:21AM +1000, Graham Dumpleton wrote: > Use of bytes everywhere can be inconvenient on the gateway/server > side, at least as far as end result for user. Yes, but wouldn't it be simpler for mod_wsgi to only deal with bytes? unicode C strings -> bytes and char* -> bytes conversions seem straightforward. But char* -> string doesn't look easy to do, since you have to 'guess' the encoding. This is suppositions, I have never worked on WSGI server/gateway. Correct me if I'm wrong. > The specific problem is that WSGI environment is used to hold > information about the original request, as CGI variables, but also can > hold user specified custom variables. > > In the case of anything hosted via Apache, such as through mod_wsgi, > mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such > custom variables using the SetEnv directive. Thus one might say: > > SetEnv trac.env_path /usr/local/trac/site-1 > > If the rule is that everything in WSGI environment coming from WSGI > adapter must be bytes then you have a potential for mismatch in > expectations of how values will be passed. That is, if set using > SetEnv then would be bytes, but if set using WSGI middleware wrapper > for configuration, more likely going to be string. It would seem > overly onerous to expect WSGI middleware to use bytes for > configuration variables as well and so force all consumers to always > be converting to string using appropriate encoding, where required > encoding potentially unknown. Is it reasonable to expect configuration variable to have a certain type? I am tempted to say 'no', but that's because I like the "everything is bytes" approach so much :) I don't have any experience with configuration variables passed via the WSGI environment though. But it could be quite a problem, for example 'Developer authentication' posted a month ago by Ian Bicking requires its configuration variable to be a string, but I don't think this spec applies to WSGI on Py3K or WSGI 2. > This is why I specifically asked previously, and which no one has > answered, if bytes is to be used, which variables in WSGI environment > should be passed as bytes. If there is a known specified list of > variables which it is known will always be bytes, may be more > manageable. If someone is going to suggest that only CGI variables > should be bytes, then what does that actually mean. Remember that for > FASTCGI, SCGI, CGI there isn't really a distinction and so where the > boundary is as to what is a CGI variable is fuzzy although you could > reverse transformation and get back bytes if know what to do it for. > > One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and > QUERY_STRING and maybe that will suffice. It may not though, because > what about headers such as HTTP_REFERRER? Also, what about additional > SSL_? variables that a SSL module for web sever may add? What you are proposing in 'black-listing' some variables known to cause problems. It will be difficult to come up with an exhaustive list of variables with different encoding. Even if we were able to come up with such a list, it creates 2 different cases and could end up complicate application developer's life. That's why the approach "everything coming from the server/gateway is bytes" makes sense, it is simpler to explain, it is simpler to understand, and it's, I think, more pythonic (There should be one-- and preferably only one --obvious way to do it.) Just consider the case of cookies, I don't know if you can use non-ASCII character in them, but it possible that it will mess up "everything is string expect a, b, c" if we forget to include it in the list. "Everything is bytes" is in this sense more future-proof than "black-listing a, b, c". If a variable with a weird encoding appears a few month after the new PEP is released, "everything is bytes" still works, but the "black-list" approach stops working. Cheers, -- Henry Pr?cheur From graham.dumpleton at gmail.com Wed Aug 12 03:23:44 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 12 Aug 2009 11:23:44 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: <20090812004025.GA29297@wrap.novuscom.net> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908040544m143ff165w88a8873ba53f54f8@mail.gmail.com> <20090804160516.9BE7D3A411E@sparrow.telecommunity.com> <20090811221403.GA18509@wrap.novuscom.net> <88e286470908111625i39ae296bpa2c80ed06ae3979d@mail.gmail.com> <20090812004025.GA29297@wrap.novuscom.net> Message-ID: <88e286470908111823p4d47f707rdd649e9f3ba83533@mail.gmail.com> 2009/8/12 Henry Precheur : > On Wed, Aug 12, 2009 at 09:25:21AM +1000, Graham Dumpleton wrote: >> Use of bytes everywhere can be inconvenient on the gateway/server >> side, at least as far as end result for user. > > Yes, but wouldn't it be simpler for mod_wsgi to only deal with bytes? > unicode C strings -> bytes and char* -> bytes conversions seem > straightforward. Programming at C code level it doesn't really make any difference as pretty well same amount of C API calls. All the code is also already written for this in mod_wsgi and configurable to be done any which way so people could play with different alternatives. When decision actually made, just need to make that decision be the default. Only extra complexity comes from where subset of WSGI environment should be bytes and to make that at least somewhat easier, need simple well defined rule and that where if first character of variable name is uppercase letter, then use bytes, might be reasonable. Anything more complicated may be a pain. > But char* -> string doesn't look easy to do, since you have to 'guess' > the encoding. Only for stuff that derives from HTTP request, which is the argument for using bytes and leave it up to application to decide. For user custom variables, then would be UTF-8 as that is what Apache effectively treats configuration file as being. > This is suppositions, I have never worked on WSGI server/gateway. Which is the same for most people and perhaps why many don't want to wade into this argument. That is, attitude is that it is a problem for those who want to write hosting adapters and not an issue for application developers. Reality is that it needs to be guided by application developers as they are the ones who have to work with whatever interface is defined. Graham > Correct me if I'm wrong. > >> The specific problem is that WSGI environment is used to hold >> information about the original request, as CGI variables, but also can >> hold user specified custom variables. >> >> In the case of anything hosted via Apache, such as through mod_wsgi, >> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such >> custom variables using the SetEnv directive. Thus one might say: >> >> ? SetEnv trac.env_path /usr/local/trac/site-1 >> >> If the rule is that everything in WSGI environment coming from WSGI >> adapter must be bytes then you have a potential for mismatch in >> expectations of how values will be passed. That is, if set using >> SetEnv then would be bytes, but if set using WSGI middleware wrapper >> for configuration, more likely going to be string. It would seem >> overly onerous to expect WSGI middleware to use bytes for >> configuration variables as well and so force all consumers to always >> be converting to string using appropriate encoding, where required >> encoding potentially unknown. > > Is it reasonable to expect configuration variable to have a certain > type? I am tempted to say 'no', but that's because I like the "everything > is bytes" approach so much :) I don't have any experience with > configuration variables passed via the WSGI environment though. > > But it could be quite a problem, for example 'Developer authentication' > posted a month ago by Ian Bicking requires its configuration variable to > be a string, but I don't think this spec applies to WSGI on Py3K or WSGI > 2. > >> This is why I specifically asked previously, and which no one has >> answered, if bytes is to be used, which variables in WSGI environment >> should be passed as bytes. If there is a known specified list of >> variables which it is known will always be bytes, may be more >> manageable. If someone is going to suggest that only CGI variables >> should be bytes, then what does that actually mean. Remember that for >> FASTCGI, SCGI, CGI there isn't really a distinction and so where the >> boundary is as to what is a CGI variable is fuzzy although you could >> reverse transformation and get back bytes if know what to do it for. >> >> One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and >> QUERY_STRING and maybe that will suffice. It may not though, because >> what about headers such as HTTP_REFERRER? Also, what about additional >> SSL_? variables that a SSL module for web sever may add? > > What you are proposing in 'black-listing' some variables known to cause > problems. > > It will be difficult to come up with an exhaustive list of variables > with different encoding. Even if we were able to come up with such a > list, it creates 2 different cases and could end up complicate > application developer's life. That's why the approach "everything coming > from the server/gateway is bytes" makes sense, it is simpler to explain, > it is simpler to understand, and it's, I think, more pythonic (There > should be one-- and preferably only one --obvious way to do it.) > > Just consider the case of cookies, I don't know if you can use non-ASCII > character in them, but it possible that it will mess up "everything is > string expect a, b, c" if we forget to include it in the list. > "Everything is bytes" is in this sense more future-proof than > "black-listing a, b, c". If a variable with a weird encoding appears a > few month after the new PEP is released, "everything is bytes" still > works, but the "black-list" approach stops working. > > > Cheers, > > -- > ?Henry Pr?cheur > From fumanchu at aminus.org Wed Aug 12 06:19:49 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Tue, 11 Aug 2009 21:19:49 -0700 Subject: [Web-SIG] WSGI 2 References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > So, for WSGI 1.0 style of interface and Python 3.0, the following is > what I was going to implement. FWIW, I'll answer with what we've implemented for CherryPy 3.2. > 1. When running under Python 3, applications SHOULD produce bytes > output, status line and headers. Yup. > 2. When running under Python 3, servers and gateways MUST accept > strings for output, status line and headers. Such strings must be > converted to bytes output using 'latin-1'. If string cannot be > converted then is treated as an error. Yes. > 3. When running under Python 3, servers MUST provide wsgi.input as a > binary (byte) input stream. Boy howdy. > 4. When running under Python 3, servers MUST provide a text stream for > wsgi.errors. In converting this to a byte stream for writing to a > file, the default encoding would be applied. I'll look into it. > 5. When running under Python 3, servers MUST provide CGI HTTP and > server variables as strings. Where such values are sourced from a byte > string, be that a Python byte string or C string, they should be > converted as 'UTF-8'. If a specific web server infrastructure is able > to support different encodings, then the WSGI adapter MAY provide a > way for a user of the WSGI adapter to customise on a global basis, or > on a per value basis what encoding is used, but this is entirely > optional. Note that there is no requirement to deal with RFC 2047. We're passing unicode for almost everything. REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries. The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset. All request headers are decoded via ISO-8859-1, which can't fail. Applications are expected to transcode these values if they believe them to be in another encoding. > This is where I am going to diverge from what has been discussed before. > > The reason I am going to pass as UTF-8 and not latin-1 is that it > looks like Apache effectively only supports use of UTF-8. Since this > means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and > even CGI likely cannot handle anything besides UTF-8 then I really > can't see the point of trying to cater for a theoretical possibility > that some HTTP client could use something besides UTF-8. In other > words, the predominant case will be UTF-8, so let us target that. That is predominant for the Request-URI, and we are defaulting to utf-8 for that as I mentioned above. I believe I demonstrated in http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 cannot be the predominant encoding for request headers, which are instead mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to ISO-8859-1. > So, rather than burden every WSGI application with the need to convert > from latin-1 back to bytes and then to UTF-8, let the server deal with > it, with server using sensible default, and where server > infrastructure can handle a different encoding, then it can provide > option to use that encoding and WSGI application doesn't need to > change. If there are indeed more headers which are ISO-8859-1, then that same argument cuts both ways. I have no problem doing the same thing here as we do for PATH_INFO: a configurable charset, or better yet a list of charsets to try in order, with a sensible default, even UTF-8 would be fine. Regardless of the default, if it is configurable, then the successful encoding should be put in a canonical environ entry so apps can transcode it if the server got it wrong. Re:bytes. We really do not want the server to set any of the above environ entries (except REQUEST_URI) to bytes. I'm surprised those of you who have substantial numbers of WSGI middleware aren't fighting this; it would mean decoding the same environ entries every time you switched middleware providers. Some of you said as much at PyCon: http://mail.python.org/pipermail/web-sig/2009-March/003701.html Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ianb at colorstudy.com Wed Aug 12 06:42:51 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 11 Aug 2009 23:42:51 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer wrote: > > 5. When running under Python 3, servers MUST provide CGI HTTP and > > server variables as strings. Where such values are sourced from a byte > > string, be that a Python byte string or C string, they should be > > converted as 'UTF-8'. If a specific web server infrastructure is able > > to support different encodings, then the WSGI adapter MAY provide a > > way for a user of the WSGI adapter to customise on a global basis, or > > on a per value basis what encoding is used, but this is entirely > > optional. Note that there is no requirement to deal with RFC 2047. > > We're passing unicode for almost everything. > > REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and > must be ascii-decodable. So are SERVER_PROTOCOL and our custom > ACTUAL_SERVER_PROTOCOL entries. > > The original bytes of the Request-URI are stored in REQUEST_URI. However, > PATH_INFO and QUERY_STRING are parsed from it, and decoded via a > configurable charset, defaulting to UTF-8. If the path cannot be decoded > with that charset, ISO-8859-1 is tried. Whichever is successful is stored at > environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if > needed. Our origin server always sets SCRIPT_NAME to '', but if we populated > it, we would make it decoded by the same charset. > My understanding is that PATH_INFO *should* be UTF-8 regardless of what encoding a page might be in. At least that's what I got when testing Firefox. It might not be valid UTF-8 if it was manually constructed, but then there's little reason to think it is valid anything; only the bytes or REQUEST_URI are likely to be an accurate representation. (Frankly I wish PATH_INFO was not url-decoded, which would remove this issue entirely -- REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't know of reasonable cases where this wouldn't be true.) I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be used to kind of reconstruct the original request path (the surrogateescape or whatever it is called would serve the same purpose, but is only available in Python 3). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Wed Aug 12 06:58:50 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 12 Aug 2009 14:58:50 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <88e286470908112158h1a434d64oc27f7919b7f3b343@mail.gmail.com> 2009/8/12 Ian Bicking : > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer wrote: >> >> > 5. When running under Python 3, servers MUST provide CGI HTTP and >> > server variables as strings. Where such values are sourced from a byte >> > string, be that a Python byte string or C string, they should be >> > converted as 'UTF-8'. If a specific web server infrastructure is able >> > to support different encodings, then the WSGI adapter MAY provide a >> > way for a user of the WSGI adapter to customise on a global basis, or >> > on a per value basis what encoding is used, but this is entirely >> > optional. Note that there is no requirement to deal with RFC 2047. >> >> We're passing unicode for almost everything. >> >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom >> ACTUAL_SERVER_PROTOCOL entries. >> >> The original bytes of the Request-URI are stored in REQUEST_URI. However, >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a >> configurable charset, defaulting to UTF-8. If the path cannot be decoded >> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if >> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated >> it, we would make it decoded by the same charset. > > My understanding is that PATH_INFO *should* be UTF-8 regardless of what > encoding a page might be in. At least that's what I got when testing > Firefox. It might not be valid UTF-8 if it was manually constructed, but > then there's little reason to think it is valid anything; only the bytes or > REQUEST_URI are likely to be an accurate representation. As I understood it, PJE was suggesting that wasn't the case. For example, what about case where URL appears for target of form POST and the encoding of that form page wasn't UTF-8. What is the browser going to send in that case. Or is this the sort of case you have tested and qualify as saying if manually constructed anything could happen? >?(Frankly I wish > PATH_INFO was not url-decoded, which would remove this issue entirely -- > REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't > know of reasonable cases where this wouldn't be true.) > I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be > used to kind of reconstruct the original request path (the surrogateescape > or whatever it is called would serve the same purpose, but is only available > in Python 3). Graham From ianb at colorstudy.com Wed Aug 12 07:05:40 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 12 Aug 2009 00:05:40 -0500 Subject: [Web-SIG] WSGI 2 In-Reply-To: <88e286470908112158h1a434d64oc27f7919b7f3b343@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908112158h1a434d64oc27f7919b7f3b343@mail.gmail.com> Message-ID: On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > 2009/8/12 Ian Bicking : > > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer > wrote: > >> > >> > 5. When running under Python 3, servers MUST provide CGI HTTP and > >> > server variables as strings. Where such values are sourced from a byte > >> > string, be that a Python byte string or C string, they should be > >> > converted as 'UTF-8'. If a specific web server infrastructure is able > >> > to support different encodings, then the WSGI adapter MAY provide a > >> > way for a user of the WSGI adapter to customise on a global basis, or > >> > on a per value basis what encoding is used, but this is entirely > >> > optional. Note that there is no requirement to deal with RFC 2047. > >> > >> We're passing unicode for almost everything. > >> > >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and > >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom > >> ACTUAL_SERVER_PROTOCOL entries. > >> > >> The original bytes of the Request-URI are stored in REQUEST_URI. > However, > >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a > >> configurable charset, defaulting to UTF-8. If the path cannot be decoded > >> with that charset, ISO-8859-1 is tried. Whichever is successful is > stored at > >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if > >> needed. Our origin server always sets SCRIPT_NAME to '', but if we > populated > >> it, we would make it decoded by the same charset. > > > > My understanding is that PATH_INFO *should* be UTF-8 regardless of what > > encoding a page might be in. At least that's what I got when testing > > Firefox. It might not be valid UTF-8 if it was manually constructed, but > > then there's little reason to think it is valid anything; only the bytes > or > > REQUEST_URI are likely to be an accurate representation. > > As I understood it, PJE was suggesting that wasn't the case. > > For example, what about case where URL appears for target of form POST > and the encoding of that form page wasn't UTF-8. What is the browser > going to send in that case. > > Or is this the sort of case you have tested and qualify as saying if > manually constructed anything could happen? > Correct -- you can write any set of % encodings, and I don't think it even has to be able to validly url-decode (e.g., /foo%zzz will work). It definitely doesn't have to be a valid encoding. However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard). This is in a case like , the browser will request /some%20page, because it escapes unsafe characters. Similarly if you request it will encode that ? in UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at least on Firefox. I used this to test: http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From henry at precheur.org Fri Aug 14 07:36:28 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 13 Aug 2009 22:36:28 -0700 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908112158h1a434d64oc27f7919b7f3b343@mail.gmail.com> Message-ID: <20090814053628.GA3210@banane.novuscom.net> On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote: > Correct -- you can write any set of % encodings, and I don't think it even > has to be able to validly url-decode (e.g., /foo%zzz will work). It > definitely doesn't have to be a valid encoding. However, if you actually > include unicode characters, they will always be encoded as UTF-8 (as goes > with the IRI standard). This is in a case like , the > browser will request /some%20page, because it escapes unsafe characters. > Similarly if you request it will encode that ?? in > UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at > least on Firefox. I used this to test: > http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py I have run some tests regarding the encoding issue: curl doesn't 'url-encode' its URLs: curl 'http://hostname/fran?ais' ^ latin-1 character The latin-1 character is send to the server. Lighttpd accepts the URL and even return a file if it exists. Of course if I try with the same characters in UTF-8 it doesn't work. AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that libcurl is quite popular (it used to be the transport library of Webkit/GTK+ for example.) It's hard to discard it as a utterly broken & obscure tool. Many 'simplistic' HTTP clients may have the same problem. Now let's talk a little bit about cookies... Cookies can contain whatever 'binary junk' the server send. RFC 2965 says (http://tools.ietf.org/html/rfc2965#page-5): > The VALUE is opaque to the user agent and may be anything the origin > server chooses to send, possibly in a server-selected printable ASCII > encoding. Also, cookies can contain 'comments' which contains UTF-8 strings. (http://tools.ietf.org/html/rfc2965#page-6): > Characters in value MUST be in UTF-8 encoding. Firefox has no problem with cookies containing non-ASCII characters. It looks like it assumes cookies are encoded using latin-1, since latin-1 characters are displayed correctly in Firebug, but not UTF-8 ones. Cheers, -- Henry Pr?cheur From graham.dumpleton at gmail.com Sun Aug 16 12:13:50 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Sun, 16 Aug 2009 20:13:50 +1000 Subject: [Web-SIG] WSGI 2 In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> Message-ID: <88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> 2009/8/12 Ian Bicking : > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer wrote: >> >> > 5. When running under Python 3, servers MUST provide CGI HTTP and >> > server variables as strings. Where such values are sourced from a byte >> > string, be that a Python byte string or C string, they should be >> > converted as 'UTF-8'. If a specific web server infrastructure is able >> > to support different encodings, then the WSGI adapter MAY provide a >> > way for a user of the WSGI adapter to customise on a global basis, or >> > on a per value basis what encoding is used, but this is entirely >> > optional. Note that there is no requirement to deal with RFC 2047. >> >> We're passing unicode for almost everything. >> >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom >> ACTUAL_SERVER_PROTOCOL entries. >> >> The original bytes of the Request-URI are stored in REQUEST_URI. However, >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a >> configurable charset, defaulting to UTF-8. If the path cannot be decoded >> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if >> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated >> it, we would make it decoded by the same charset. > > My understanding is that PATH_INFO *should* be UTF-8 regardless of what > encoding a page might be in. ?At least that's what I got when testing > Firefox. ?It might not be valid UTF-8 if it was manually constructed, but > then there's little reason to think it is valid anything; only the bytes or > REQUEST_URI are likely to be an accurate representation. ?(Frankly I wish > PATH_INFO was not url-decoded, which would remove this issue entirely -- > REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't > know of reasonable cases where this wouldn't be true.) > I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be > used to kind of reconstruct the original request path (the surrogateescape > or whatever it is called would serve the same purpose, but is only available > in Python 3). Thinking about it for a while, I get the feel that having a fallback to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That URLs wouldn't consistently use the same encoding all the time just seems wrong. I would see it as returning a bad request status. If an application coder knows they are actually going to be dealing with latin-1, as that is how the application is written, then they should be specifying it should be latin-1 always instead of utf-8. Thus, the WSGI adapter should provide a means to override what encoding is used. For simple WSGI adapters which only service one WGSI application, then it would apply to whole URL namespace. For something like Apache where could map to multiple WSGI applications, then it may want to provide means of overriding encoding for specific subsets o URLs, ie., using Location directive for example. Graham From fumanchu at aminus.org Mon Aug 17 05:06:03 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sun, 16 Aug 2009 20:06:03 -0700 Subject: [Web-SIG] WSGI 2: Decoding the Request-URI In-Reply-To: <88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> Message-ID: I wrote: > PATH_INFO and QUERY_STRING are ... decoded via a configurable > charset, defaulting to UTF-8. If the path cannot be decoded > with that charset, ISO-8859-1 is tried. Whichever is successful > is stored at environ['REQUEST_URI_ENCODING'] so middleware and > apps can transcode if needed. and Ian replied: > My understanding is that PATH_INFO *should* be UTF-8 regardless of > what encoding a page might be in. ?At least that's what I got when > testing Firefox. ?It might not be valid UTF-8 if it was manually > constructed, but then there's little reason to think it is valid... Actually, current browsers tend to use UTF-8 for the path, and either the encoding of the document [1] or Windows-1252 [2] for the querystring. But the vast majority of HTTP user agents are not browsers [3]. Even if that were not so, we should not define WSGI to only interoperate with the most current browsers. and Graham added: > Thinking about it for a while, I get the feel that having a fallback > to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That > URLs wouldn't consistently use the same encoding all the time just > seems wrong. I would see it as returning a bad request status. If an > application coder knows they are actually going to be dealing with > latin-1, as that is how the application is written, then they should > be specifying it should be latin-1 always instead of utf-8. Thus, the > WSGI adapter should provide a means to override what encoding is used. Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's. > For simple WSGI adapters which only service one WGSI application, then > it would apply to whole URL namespace. As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application. The encoding used for a URI is only really important for one reason: URI comparison. Comparison is at the heart of handler dispatch, static resource identification, and proper HTTP cache operation. It is for these reasons that RFC 3986 has an extensive section on the matter [5], including a "ladder" of approaches: * Simple String Comparison * Case Normalization (e.g. /a%3D == /a%3d) * Percent-Encoding Normalization (e.g. /a%62c == /abc) * Path Segment Normalization (e.g. /abc/../def == /def) * Scheme-Based Normalization (e.g. http://example.com == http://example.com:80/) * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing showed it to be) I think it would be beneficial to those who develop WSGI application interfaces to be able to assume that at least case-, percent-, path-, and scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by all WSGI 2 origin servers. All of those except for the first one can be accomplished without decoding the target URI. But that first section specifically states: "In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding." In other words, the URI spec seems to imply that the two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the former is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF" encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard. > For something like Apache where > could map to multiple WSGI applications, then it may want to provide > means of overriding encoding for specific subsets o URLs, ie., using > Location directive for example. For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Yet we still haven't answered the question of how to handle unforeseen encodings. You're right that, if the server-side stack as a whole cannot map a particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 over 400, but either is fine. However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading "/" character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least not without their specification becoming unwieldy. Robert Brewer fumanchu at aminus.org [1] http://markmail.org/message/r6qzszybsk5pwzbt [2] http://markmail.org/message/47cekkpvdjaectvi [3] http://markmail.org/message/3bsxo7q6eztcp3yo [4] http://www.w3.org/TR/html4/interact/forms.html#idx-character_encoding [5] http://tools.ietf.org/html/rfc3986#section-6 From fumanchu at aminus.org Mon Aug 17 16:37:43 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 17 Aug 2009 07:37:43 -0700 Subject: [Web-SIG] WSGI 2: Decoding the Request-URI References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com><88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> Message-ID: I wrote: > Applications do produce URI's (and IRI's, etc. that need to be > converted into URI's) and do transfer them in media types like > HTML, which define how to encode a.href's and form.action's > before %-encoding them [4]. But these are not the only vectors > by which clients obtain or generate Request-URI's. > ... > As someone (Alan Kennedy?) noted at PyCon, static resources may > depend upon a filename encoding defined by the OS which is > different than that of the rest of the URI's generated/understood > by even the most coherent application. > ... > "In practical terms, character-by-character comparisons should be > done codepoint-by-codepoint after conversion to a common character > encoding." In other words, the URI spec seems to imply that the > two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the former > is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF" > encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about > this, since all environ values must be byte strings. IMO WSGI > 2 should do better in this regard. > ... > For the three reasons above, I don't think we can assume that the > application will always receive equivalent URI's encoded in a > single, foreseen encoding. Did I say 3 reasons? I meant 4: Accept-Charset. Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From pje at telecommunity.com Mon Aug 17 20:53:35 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 17 Aug 2009 14:53:35 -0400 Subject: [Web-SIG] WSGI 2: Decoding the Request-URI In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> Message-ID: <20090817185350.770493A4079@sparrow.telecommunity.com> At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote: >Did I say 3 reasons? I meant 4: Accept-Charset. Chief amongst the reasons... amongst our reasonry... Right, we'll come in again. ;-) From henry at precheur.org Thu Aug 20 20:03:13 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 20 Aug 2009 11:03:13 -0700 Subject: [Web-SIG] WSGI 2: Decoding the Request-URI In-Reply-To: References: <88e286470908031738v428da90dl80fdc6cde16e84ba@mail.gmail.com> <88e286470908160313v371b70cdt21e7dc53416141a6@mail.gmail.com> Message-ID: <20090820180313.GA16331@wrap.novuscom.net> On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote: > However, we quite often use only a portion of the URI when attempting > to locate an appropriate handler; sometimes just the leading "/" > character! The remaining characters are often passed as function > arguments to the handler, or stuck in some parameter list/dict. In > many cases, the charset used to decode these values either: is > unimportant; follows complex rules from one resource to another; or is > merely reencoded, since the application really does care about bytes > and not characters. Falling back to ISO-8859-1 (and minting a new WSGI > environ entry to declare the charset which was used to decode) can > handle all of these cases. Server configuration options cannot, at > least not without their specification becoming unwieldy. (Just to make things clear, I am not just talking about REQUEST_URI here, but all request headers) Encoding everything using ISO-8859-1 has the nice property of keeping informations intact. It would be good heuristic if everything with a few exceptions was encoded using ISO-8859-1. Just transcode the few problematic cases at the application level and everybody is happy. A string encoded from ISO-8859-1 is like a bytes object with a string 'interface' on top of it. But it sweep the encoding problem under the carpet. The problem with Python 2 was that str and unicode were almost the same, so much the same that it was possible to mix them without too much problems: >>> 'foo' == u'foo' True Python 3 made bytes and string 'incompatible' to force programmers to handle the encoding problem as soon as possible: >>> b'foo' == 'foo' False By passing `str()` to the application, the application author could believe that the encoding problem has been handled. But in most cases it hasn't been handled at all. The application author should still transcode all the strings incorrectly encoded. We are back to Python 2's bad old days, where we can't be sure that what we got is properly encoded: Was that string encoded using latin-1? Maybe a middleware transcoded it to UTF-8 before the application was called. Maybe the application itself transcoded it at some point, but then we need to keep track of what was transcoded. Maybe the application should transcode everything when it is called. Also EVERY application author will have to read the PEP, especially the paragraph saying: > Everything we give you are strings, but you still have to deal > with the encoding mess. Otherwise he will have weird problems like when he was using Python 2. Because the interface is not clear. strings are supposed to be text and only text. Encoding everything to ISO-8859-1 means strings are not text anymore, they are 'encoded data' [1]. bytes are supposed to be 'encoded data' and binary blobs. By giving applications bytes, the author knows right away he should decode them. No need to read the PEP. `bytes` can do everything `str` can do with the notable exception of 'format'. >>> b'foo bar'.title() b'Foo Bar' >>> b'/foo/bar/fran\xc3ois'.split(b'/') [b'', b'foo', b'bar', b'fran\xc3ois'] >>> re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups() (b'foo', b'1234') I understand that `bytes()` is an unfamiliar beast. But I believe the encoding problem is the realm of the application, not the realm of the gateway. Let the application handle the encoding problem and don't give it a half baked solution. Using bytes also has its set of problems. The standard library doesn't support bytes very well. For example urllib.response.unquote() doesn't work with bytes, and urllib.parse too has issues. [1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit -- Henry Pr?cheur