From graham.dumpleton at gmail.com Wed Apr 1 12:29:18 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 1 Apr 2009 21:29:18 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. Message-ID: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Based on any discussions at PyCon, can someone give a summary of any conclusions drawn about how WSGI 1.0 should be implemented in Python 3.0. The previous analysis of this is at: http://www.wsgi.org/wsgi/Amendments_1.0 I realise it may be work in progress, but I note that work being done on WSGI server associated with CherryPy for Python 3.0 by Robert isn't necessarily following that and is perhaps starting to do things in a way that I understood were only being speculated upon for WSGI 2.0, not for WSGI 1.0. For example: http://www.cherrypy.org/changeset/2199 In particular, it has: environ["SCRIPT_NAME"] = b"" The bit from prior analysis which is relevant is: """When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?)""" Since mod_wsgi has used the prior analysis as basis of Python 3.0 support, would want to know pretty soon what direction WSGI 1.0 under Python 3.0 is going to take, else I am going to have to delay releasing mod_wsgi 3.0 or simply yank the support for Python 3.0. Robert, yes I know I could have asked you direct, but want a consensus from all who were present at PyCon and discussed these things. Graham From fumanchu at aminus.org Wed Apr 1 14:18:37 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 05:18:37 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > Based on any discussions at PyCon, can someone give a summary of any > conclusions drawn about how WSGI 1.0 should be implemented in Python > 3.0. > > The previous analysis of this is at: > > http://www.wsgi.org/wsgi/Amendments_1.0 > > I realise it may be work in progress, but I note that work being done > on WSGI server associated with CherryPy for Python 3.0 by Robert isn't > necessarily following that and is perhaps starting to do things in a > way that I understood were only being speculated upon for WSGI 2.0, > not for WSGI 1.0. For example: > > http://www.cherrypy.org/changeset/2199 > > In particular, it has: > > environ["SCRIPT_NAME"] = b"" > > The bit from prior analysis which is relevant is: > > """When running under Python 3, servers MUST provide CGI HTTP > variables as strings, decoded from the headers using HTTP standard > encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI > or WSGI variables that should NOT be strings?)""" > > Since mod_wsgi has used the prior analysis as basis of Python 3.0 > support, would want to know pretty soon what direction WSGI 1.0 under > Python 3.0 is going to take, else I am going to have to delay > releasing mod_wsgi 3.0 or simply yank the support for Python 3.0. > > Robert, yes I know I could have asked you direct, but want a consensus > from all who were present at PyCon and discussed these things. Good timing. We had been thinking to make everything strings except for SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled from the Request-URI, which may be in any encoding. It was thought that the app would be best-qualified to decode those three. I hope to discuss that further this morning at the sprints. Turns out the cgi module in Python 3 only does text, not bytes. I considered submitting a patch to make it handle bytes for fp/environ but that became difficult quickly and may complicate the cgi module needlessly if we can instead use unicode for those 3 environ entries. I'll report back here. Robert Brewer fumanchu at aminus.org From guido at python.org Wed Apr 1 18:34:24 2009 From: guido at python.org (Guido van Rossum) Date: Wed, 1 Apr 2009 09:34:24 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer wrote: > Good timing. We had been thinking to make everything strings except for > SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled > from the Request-URI, which may be in any encoding. It was thought that > the app would be best-qualified to decode those three. Argh. The *meaning* of these fields is clearly text. It would be most unfortunately if all apps were required to deal with decoding bytes for these (there is no choice any more, unlike in 2.x). I appreciate the sentiment that the encoding is unknown, but I would much prefer it if there was a default encoding that the app could override, or if there was some other mechanism whereby the app would not have to be bothered with decoding bytes unless it cared. Note that Py3k also treats filenames as text, with an optional escape hatch for using bytes that only very few apps will need to use. > I hope to discuss that further this morning at the sprints. Turns out > the cgi module in Python 3 only does text, not bytes. I considered > submitting a patch to make it handle bytes for fp/environ but that > became difficult quickly and may complicate the cgi module needlessly if > we can instead use unicode for those 3 environ entries. I'll report back > here. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From fumanchu at aminus.org Wed Apr 1 18:37:38 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 09:37:38 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: Guido van Rossum wrote: > Sent: Wednesday, April 01, 2009 9:34 AM > To: Robert Brewer > Cc: Web SIG > Subject: Re: [Web-SIG] Python 3.0 and WSGI 1.0. > > On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer > wrote: > > Good timing. We had been thinking to make everything strings except > for > > SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled > > from the Request-URI, which may be in any encoding. It was thought > that > > the app would be best-qualified to decode those three. > > Argh. The *meaning* of these fields is clearly text. It would be most > unfortunately if all apps were required to deal with decoding bytes > for these (there is no choice any more, unlike in 2.x). I appreciate > the sentiment that the encoding is unknown, but I would much prefer it > if there was a default encoding that the app could override, or if > there was some other mechanism whereby the app would not have to be > bothered with decoding bytes unless it cared. > > Note that Py3k also treats filenames as text, with an optional escape > hatch for using bytes that only very few apps will need to use. Understood. I think we have plenty of options here for returning text. We'll discuss this ASAP in the room. Robert Brewer fumanchu at aminus.org From janssen at parc.com Wed Apr 1 19:59:56 2009 From: janssen at parc.com (Bill Janssen) Date: Wed, 1 Apr 2009 10:59:56 PDT Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: <86217.1238608796@parc.com> Guido van Rossum wrote: > On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer wrote: > > Good timing. We had been thinking to make everything strings except for > > SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled > > from the Request-URI, which may be in any encoding. It was thought that > > the app would be best-qualified to decode those three. > > Argh. The *meaning* of these fields is clearly text. I wouldn't read too much into those names -- they were chosen when the CGI spec was just gestating, long before the usage patterns solidified, and don't necessarily reflect the usage of the data bound to them. I believe this work was done before the formal IETF definition of a URL, for instance. I think the controlling reference here is RFC 3875. It's not at all clear to me what the SCRIPT_NAME is. Is it a pathname, involving the local file system's filenames, which recent discussions seem to indicate may or may not correspond to human-notional strings, or a URI path? I'm OK with calling it text, with a proviso that there may be cases where it's not. I've never actually seen a CGI call with PATH_INFO set; I think it's obsolete usage (but pretty clearly a string). RFC 3875 says, "Similarly, treatment of non US-ASCII characters in the path is system-defined." QUERY_STRING -- should always be an ASCII string. May indeed encode non-Unicode strings or purely binary data, but when passed to the CGI script, it's still encoded as it was in the URI. Bill From ianb at colorstudy.com Wed Apr 1 21:15:17 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 1 Apr 2009 14:15:17 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: On Wed, Apr 1, 2009 at 11:34 AM, Guido van Rossum wrote: > On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer wrote: >> Good timing. We had been thinking to make everything strings except for >> SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled >> from the Request-URI, which may be in any encoding. It was thought that >> the app would be best-qualified to decode those three. > > Argh. The *meaning* of these fields is clearly text. It would be most > unfortunately if all apps were required to deal with decoding bytes > for these (there is no choice any more, unlike in 2.x). I appreciate > the sentiment that the encoding is unknown, but I would much prefer it > if there was a default encoding that the app could override, or if > there was some other mechanism whereby the app would not have to be > bothered with decoding bytes unless it cared. This might be fine, except it is hard. You can't just take arbitrary bytes and do script_name.decode('utf8'), and then when you realize you had it wrong do script_name.encode('utf8').decode('latin1'). -- Ian Bicking | http://blog.ianbicking.org From guido at python.org Wed Apr 1 22:09:17 2009 From: guido at python.org (Guido van Rossum) Date: Wed, 1 Apr 2009 13:09:17 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: On Wed, Apr 1, 2009 at 12:15 PM, Ian Bicking wrote: > On Wed, Apr 1, 2009 at 11:34 AM, Guido van Rossum wrote: >> On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer wrote: >>> Good timing. We had been thinking to make everything strings except for >>> SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled >>> from the Request-URI, which may be in any encoding. It was thought that >>> the app would be best-qualified to decode those three. >> >> Argh. The *meaning* of these fields is clearly text. It would be most >> unfortunately if all apps were required to deal with decoding bytes >> for these (there is no choice any more, unlike in 2.x). I appreciate >> the sentiment that the encoding is unknown, but I would much prefer it >> if there was a default encoding that the app could override, or if >> there was some other mechanism whereby the app would not have to be >> bothered with decoding bytes unless it cared. > > This might be fine, except it is hard. ?You can't just take arbitrary > bytes and do script_name.decode('utf8'), and then when you realize you > had it wrong do script_name.encode('utf8').decode('latin1'). Well you could make the bytes versions available under different keys. I think you do something a bit similar this in webob, e.g. req.params vs. req.str_params. (Perhaps you could have QUERY_STRING and QUERY_BYTES.) The decode() call used to create the text strings could use 'replace' as the error handler and the app could check for the presence of the replacement character ('\ufffd') in the string to see if there was a problem; or it could just work with the string containing that character and report the user some kind of 40x or 50x error. Frameworks (like webob) would of course do the right thing and look for QUERY_BYTES before QUERY_STRING. QUERY_BYTES should probably be optional. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From graham.dumpleton at gmail.com Wed Apr 1 22:51:35 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 07:51:35 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> 2009/4/2 Guido van Rossum : > On Wed, Apr 1, 2009 at 12:15 PM, Ian Bicking wrote: >> On Wed, Apr 1, 2009 at 11:34 AM, Guido van Rossum wrote: >>> On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer wrote: >>>> Good timing. We had been thinking to make everything strings except for >>>> SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are pulled >>>> from the Request-URI, which may be in any encoding. It was thought that >>>> the app would be best-qualified to decode those three. >>> >>> Argh. The *meaning* of these fields is clearly text. It would be most >>> unfortunately if all apps were required to deal with decoding bytes >>> for these (there is no choice any more, unlike in 2.x). I appreciate >>> the sentiment that the encoding is unknown, but I would much prefer it >>> if there was a default encoding that the app could override, or if >>> there was some other mechanism whereby the app would not have to be >>> bothered with decoding bytes unless it cared. >> >> This might be fine, except it is hard. ?You can't just take arbitrary >> bytes and do script_name.decode('utf8'), and then when you realize you >> had it wrong do script_name.encode('utf8').decode('latin1'). > > Well you could make the bytes versions available under different keys. > I think you do something a bit similar this in webob, e.g. req.params > vs. req.str_params. (Perhaps you could have QUERY_STRING and > QUERY_BYTES.) The decode() call used to create the text strings could > use 'replace' as the error handler and the app could check for the > presence of the replacement character ('\ufffd') in the string to see > if there was a problem; or it could just work with the string > containing that character and report the user some kind of 40x or 50x > error. Frameworks (like webob) would of course do the right thing and > look for QUERY_BYTES before QUERY_STRING. QUERY_BYTES should probably > be optional. Can we please not invent new names at global context in WSGI environment dictionary, especially ones that mutate existing names rather than using a prefix or suffix. If we are going to carry values in two different formats, then use the 'wsgi' name space. Thus, for byte versions of values perhaps use: wsgi.request_uri wsgi.script_name wsgi.path_info wsgi.query_string etc In other words, leave all the existing CGI variables to come through as latin-1 decode and do anything new in 'wsgi' variable namespace, identifying only the minimal set which needs to be made available as bytes. Graham From fumanchu at aminus.org Thu Apr 2 00:30:02 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 15:30:02 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > 2009/4/2 Guido van Rossum : > > On Wed, Apr 1, 2009 at 12:15 PM, Ian Bicking > wrote: > >> On Wed, Apr 1, 2009 at 11:34 AM, Guido van Rossum > wrote: > >>> On Wed, Apr 1, 2009 at 5:18 AM, Robert Brewer > wrote: > >>>> Good timing. We had been thinking to make everything strings > except for > >>>> SCRIPT_NAME, PATH_INFO, and QUERY_STRING, since these few are > pulled > >>>> from the Request-URI, which may be in any encoding. It was thought > that > >>>> the app would be best-qualified to decode those three. > >>> > >>> Argh. The *meaning* of these fields is clearly text. It would be > most > >>> unfortunately if all apps were required to deal with decoding bytes > >>> for these (there is no choice any more, unlike in 2.x). I > appreciate > >>> the sentiment that the encoding is unknown, but I would much prefer > it > >>> if there was a default encoding that the app could override, or if > >>> there was some other mechanism whereby the app would not have to be > >>> bothered with decoding bytes unless it cared. > >> > >> This might be fine, except it is hard. ?You can't just take > arbitrary > >> bytes and do script_name.decode('utf8'), and then when you realize > you > >> had it wrong do script_name.encode('utf8').decode('latin1'). > > > > Well you could make the bytes versions available under different > keys. > > I think you do something a bit similar this in webob, e.g. req.params > > vs. req.str_params. (Perhaps you could have QUERY_STRING and > > QUERY_BYTES.) The decode() call used to create the text strings could > > use 'replace' as the error handler and the app could check for the > > presence of the replacement character ('\ufffd') in the string to see > > if there was a problem; or it could just work with the string > > containing that character and report the user some kind of 40x or 50x > > error. Frameworks (like webob) would of course do the right thing and > > look for QUERY_BYTES before QUERY_STRING. QUERY_BYTES should probably > > be optional. > > Can we please not invent new names at global context in WSGI > environment dictionary, especially ones that mutate existing names > rather than using a prefix or suffix. > > If we are going to carry values in two different formats, then use the > 'wsgi' name space. Thus, for byte versions of values perhaps use: > > wsgi.request_uri > wsgi.script_name > wsgi.path_info > wsgi.query_string > etc > > In other words, leave all the existing CGI variables to come through > as latin-1 decode and do anything new in 'wsgi' variable namespace, > identifying only the minimal set which needs to be made available as > bytes. Some thoughts: 1. If we always decode as Latin-1 it should be lossless, and consumers could retrieve the original bytes with val.decode('Latin-1'), thus removing the need for separate entries. 2. CGI says, "REMOTE_USER = *OCTET" :( 3. Bikeshed: "wsgi.xyz" is too close to "XYZ" in my opinion. Robert Brewer fumanchu at aminus.org From pje at telecommunity.com Thu Apr 2 00:37:49 2009 From: pje at telecommunity.com (P.J. Eby) Date: Wed, 01 Apr 2009 18:37:49 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: <20090401223524.F138A3A40A7@sparrow.telecommunity.com> At 01:09 PM 4/1/2009 -0700, Guido van Rossum wrote: >Well you could make the bytes versions available under different keys. >I think you do something a bit similar this in webob, e.g. req.params >vs. req.str_params. (Perhaps you could have QUERY_STRING and >QUERY_BYTES.) The decode() call used to create the text strings could >use 'replace' as the error handler and the app could check for the >presence of the replacement character ('\ufffd') in the string to see >if there was a problem; or it could just work with the string >containing that character and report the user some kind of 40x or 50x >error. Frameworks (like webob) would of course do the right thing and >look for QUERY_BYTES before QUERY_STRING. QUERY_BYTES should probably >be optional. The big problem I see with this approach is that any middleware that operates on these environment keys would have to be changed. I think perhaps the problem here is the assumption that the environ dictionary has to be a straight-up copy of os.environ, when it can be whatever we want it to be. If wsgiref or other CGI->WSGI gateways have to change to get the environ set up correctly, this is less of a problem than forcing everybody to rewrite their middleware and apps. From guido at python.org Thu Apr 2 00:51:34 2009 From: guido at python.org (Guido van Rossum) Date: Wed, 1 Apr 2009 15:51:34 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <20090401223524.F138A3A40A7@sparrow.telecommunity.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <20090401223524.F138A3A40A7@sparrow.telecommunity.com> Message-ID: On Wed, Apr 1, 2009 at 3:37 PM, P.J. Eby wrote: > At 01:09 PM 4/1/2009 -0700, Guido van Rossum wrote: >> >> Well you could make the bytes versions available under different keys. >> I think you do something a bit similar this in webob, e.g. req.params >> vs. req.str_params. (Perhaps you could have QUERY_STRING and >> QUERY_BYTES.) The decode() call used to create the text strings could >> use 'replace' as the error handler and the app could check for the >> presence of the replacement character ('\ufffd') in the string to see >> if there was a problem; or it could just work with the string >> containing that character and report the user some kind of 40x or 50x >> error. Frameworks (like webob) would of course do the right thing and >> look for QUERY_BYTES before QUERY_STRING. QUERY_BYTES should probably >> be optional. > > The big problem I see with this approach is that any middleware that > operates on these environment keys would have to be changed. > > I think perhaps the problem here is the assumption that the environ > dictionary has to be a straight-up copy of os.environ, when it can be > whatever we want it to be. ?If wsgiref or other CGI->WSGI gateways have to > change to get the environ set up correctly, this is less of a problem than > forcing everybody to rewrite their middleware and apps. Well I would assume that changing the type of these variables to bytes would *also* cause problems for a lot of middleware. The proposal that the bytes values should be in the 'wsgi.*' namespace would work for me too. Note that os.environ already has some not-entirely-solved problems with encodings, which we currently try to pretend don't exist... -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alan at xhaus.com Thu Apr 2 00:53:16 2009 From: alan at xhaus.com (Alan Kennedy) Date: Wed, 1 Apr 2009 17:53:16 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> Message-ID: <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> Hi Graham, I think yours is a good solution to the problem. [Graham] > In other words, leave all the existing CGI variables to come through > as latin-1 decode As latin-1 or rfc-2047 decoded, to unicode. > and do anything new in 'wsgi' variable namespace, So the server provides "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" "wsgi.server_decoded_PATH_INFO" == u"whatever" "wsgi.server_decode_charset" == u"utf-8" Just my ?0,02. Alan. From graham.dumpleton at gmail.com Thu Apr 2 01:00:10 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 10:00:10 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> Message-ID: <88e286470904011600j340bbec6m5d307ea552bb20ef@mail.gmail.com> 2009/4/2 Alan Kennedy : > Hi Graham, > > I think yours is a good solution to the problem. > > [Graham] >> In other words, leave all the existing CGI variables to come through >> as latin-1 decode > > As latin-1 or rfc-2047 decoded, to unicode. Has anyone actually got an example of code for doing RFC-2047 decoding. Are there even any systems which make use of that encoding for web requests anyway. I still haven't really addressed that decoding requirement and I haven't seen any existing Python web stuff that tries to. >> and do anything new in 'wsgi' variable namespace, > > So the server provides > > "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" > "wsgi.server_decoded_PATH_INFO" == u"whatever" > "wsgi.server_decode_charset" == u"utf-8" Hmmm, I thought we were talking about the 'wsgi.' variants being bytes. Ie., only talking here about Python 3.0. The existing SCRIPT_NAME etc, would be string (unicode), but as latin-1. Graham From fumanchu at aminus.org Thu Apr 2 01:05:26 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 16:05:26 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> Message-ID: Alan Kennedy wrote: > Hi Graham, > > I think yours is a good solution to the problem. > > [Graham] > > In other words, leave all the existing CGI variables to come through > > as latin-1 decode > > As latin-1 or rfc-2047 decoded, to unicode. > > > and do anything new in 'wsgi' variable namespace, > > So the server provides > > "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" > "wsgi.server_decoded_PATH_INFO" == u"whatever" > "wsgi.server_decode_charset" == u"utf-8" I think everyone at the sprint today acquiesced to having SCRIPT_NAME/PATH_INFO/QUERY_STRING be set in the environ as unicode. The server can decide (probably subject to configuration). I've implemented this in the python3 branch of CherryPy and it seems to work brilliantly. Assuming the server *is* configurable, deployers should be able to choose Latin-1 if they need to recover the original bytes, without having to support a separate set of encoded-byte entries. Side note: wrapping the wsgi.input fp in a DecodingWrapper before handing it to cgi works great, too. No need to rewrite the cgi module to support bytes as I feared. Robert Brewer fumanchu at aminus.org From fumanchu at aminus.org Thu Apr 2 01:07:14 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 16:07:14 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011600j340bbec6m5d307ea552bb20ef@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com><4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> <88e286470904011600j340bbec6m5d307ea552bb20ef@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > Has anyone actually got an example of code for doing RFC-2047 > decoding. Are there even any systems which make use of that encoding > for web requests anyway. I still haven't really addressed that > decoding requirement and I haven't seen any existing Python web stuff > that tries to. http://www.cherrypy.org/browser/trunk/cherrypy/lib/http.py#L196 Currently, CP apps call that. We can move that to the server if desired. Robert Brewer fumanchu at aminus.org From graham.dumpleton at gmail.com Thu Apr 2 01:11:30 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 10:11:30 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> Message-ID: <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com> 2009/4/2 Robert Brewer : > Alan Kennedy wrote: >> Hi Graham, >> >> I think yours is a good solution to the problem. >> >> [Graham] >> > In other words, leave all the existing CGI variables to come through >> > as latin-1 decode >> >> As latin-1 or rfc-2047 decoded, to unicode. >> >> > and do anything new in 'wsgi' variable namespace, >> >> So the server provides >> >> "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" >> "wsgi.server_decoded_PATH_INFO" == u"whatever" >> "wsgi.server_decode_charset" == u"utf-8" > > I think everyone at the sprint today acquiesced to having > SCRIPT_NAME/PATH_INFO/QUERY_STRING be set in the environ as unicode. The > server can decide (probably subject to configuration). I've implemented > this in the python3 branch of CherryPy and it seems to work brilliantly. > Assuming the server *is* configurable, deployers should be able to > choose Latin-1 if they need to recover the original bytes, without > having to support a separate set of encoded-byte entries. Seems to me that you can't have it be configurable and it must always be latin-1 interpretation. The problem is where you are composing multiple WSGI applications. If they each have different expectations or requirements as to how it is handled, aren't you going to have a problem. Or am I missing something in the way you are explaining it? Graham From alan at xhaus.com Thu Apr 2 01:15:34 2009 From: alan at xhaus.com (Alan Kennedy) Date: Wed, 1 Apr 2009 18:15:34 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <86217.1238608796@parc.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> Message-ID: <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> Hi Bill, [Bill] > I think the controlling reference here is RFC 3875. I think the controlling references are RFC 2616, RFC 2396 and RFC 3987. RFC 2616, the HTTP 1.1 spec, punts on the question of character encoding for the request URI. RFC 2396, the URI spec, says """ It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification. """ RFC 3987 is that spec, for Internationalized Resource Identifiers. It says """ An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). """ and """ 1.2. Applicability IRIs are designed to be compatible with recommendations for new URI schemes [RFC2718]. The compatibility is provided by specifying a well-defined and deterministic mapping from the IRI character sequence to the functionally equivalent URI character sequence. Practical use of IRIs (or IRI references) in place of URIs (or URI references) depends on the following conditions being met: """ followed by """ c. The URI corresponding to the IRI in question has to encode original characters into octets using UTF-8. For new URI schemes, this is recommended in [RFC2718]. It can apply to a whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN syntax [RFC2141]). It can apply to a specific part of a URI, such as the fragment identifier (e.g., [XPointer]). It can apply to a specific URI or part(s) thereof. For details, please see section 6.4. """ I think the question is "are people using IRIs in the wild"? If so, then we must decide how do we best deal with the problems of recognising iso-8859-1+rfc2037 versus utf-8, or whatever server-configured encoding the user has chosen. Alan. From fumanchu at aminus.org Thu Apr 2 01:22:03 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 1 Apr 2009 16:22:03 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > 2009/4/2 Robert Brewer : > > Alan Kennedy wrote: > >> Hi Graham, > >> > >> I think yours is a good solution to the problem. > >> > >> [Graham] > >> > In other words, leave all the existing CGI variables to come > through > >> > as latin-1 decode > >> > >> As latin-1 or rfc-2047 decoded, to unicode. > >> > >> > and do anything new in 'wsgi' variable namespace, > >> > >> So the server provides > >> > >> "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" > >> "wsgi.server_decoded_PATH_INFO" == u"whatever" > >> "wsgi.server_decode_charset" == u"utf-8" > > > > I think everyone at the sprint today acquiesced to having > > SCRIPT_NAME/PATH_INFO/QUERY_STRING be set in the environ as unicode. > The > > server can decide (probably subject to configuration). I've > implemented > > this in the python3 branch of CherryPy and it seems to work > brilliantly. > > Assuming the server *is* configurable, deployers should be able to > > choose Latin-1 if they need to recover the original bytes, without > > having to support a separate set of encoded-byte entries. > > Seems to me that you can't have it be configurable and it must always > be latin-1 interpretation. The problem is where you are composing > multiple WSGI applications. If they each have different expectations > or requirements as to how it is handled, aren't you going to have a > problem. Or am I missing something in the way you are explaining it? I would not expect multiple middlewares to want to decode the same URI differently. But I would assume you'd run into problems when multiple URI's in the same site had different encodings. Mark Ramm gave the use case of exposing Unix filenames-as-bytes in URL's--the encoding is unknown but a human may know better. Allowing/forcing the human to stick that information in the app or in the server is the same work, IMO. A server could be configurable to the point of using different encodings for different URI's via regex matching or sections or some other means. I'd be happy with a spec that said, "servers MUST always decode these 3 entries, but SHOULD allow the encoding used to be configurable." I'd be equally happy with a spec that said, "servers MUST always decode these 3 as Latin-1" and explain why. Both have their manageable pros and cons. But delaying the decoding to the app by setting those 3 entries as bytes has more cons than pros. Robert Brewer fumanchu at aminus.org From graham.dumpleton at gmail.com Thu Apr 2 01:42:23 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 10:42:23 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com> Message-ID: <88e286470904011642v138385a8tfe9889197fe69a3b@mail.gmail.com> 2009/4/2 Robert Brewer : > Graham Dumpleton wrote: >> 2009/4/2 Robert Brewer : >> > Alan Kennedy wrote: >> >> Hi Graham, >> >> >> >> I think yours is a good solution to the problem. >> >> >> >> [Graham] >> >> > In other words, leave all the existing CGI variables to come >> through >> >> > as latin-1 decode >> >> >> >> As latin-1 or rfc-2047 decoded, to unicode. >> >> >> >> > and do anything new in 'wsgi' variable namespace, >> >> >> >> So the server provides >> >> >> >> "wsgi.server_decoded_SCRIPT_NAME" == u"whatever" >> >> "wsgi.server_decoded_PATH_INFO" == u"whatever" >> >> "wsgi.server_decode_charset" == u"utf-8" >> > >> > I think everyone at the sprint today acquiesced to having >> > SCRIPT_NAME/PATH_INFO/QUERY_STRING be set in the environ as unicode. >> The >> > server can decide (probably subject to configuration). I've >> implemented >> > this in the python3 branch of CherryPy and it seems to work >> brilliantly. >> > Assuming the server *is* configurable, deployers should be able to >> > choose Latin-1 if they need to recover the original bytes, without >> > having to support a separate set of encoded-byte entries. >> >> Seems to me that you can't have it be configurable and it must always >> be latin-1 interpretation. The problem is where you are composing >> multiple WSGI applications. If they each have different expectations >> or requirements as to how it is handled, aren't you going to have a >> problem. Or am I missing something in the way you are explaining it? > > I would not expect multiple middlewares to want to decode the same URI > differently. I was not thinking about multiple middlewares, but multiple distinct WSGI applications (end consumer, not middleware) composited together by something like Paste cascade, Pylons configuration or even something like a routes based dispatcher. In the case of something like cascade they aren't necessarily on different URLs. For the later they would be, even so, just making sure that having different URLs with different encodings isn't going to be an issue in respect of mapping middleware. So long as code/config files are always UTF-8 encoded and capable of representing any possible decodings of URL, then probably okay. Graham From alan at xhaus.com Thu Apr 2 01:43:32 2009 From: alan at xhaus.com (Alan Kennedy) Date: Wed, 1 Apr 2009 18:43:32 -0500 Subject: [Web-SIG] WSGI Open Space @ PyCon. In-Reply-To: References: <4a951aa00903271330w48055728i582263fcf67687e5@mail.gmail.com> Message-ID: <4a951aa00904011643u54e667dy745f94b8a26191dc@mail.gmail.com> [Noah] > +1 on the iterator, although I might just like the idea and might be missing > something important. ?It seems like there are a lot of powerful things being > developed with generators in mind, and there are some nifty things you can > do with them like the contextlib example: > ?http://docs.python.org/library/contextlib.html#contextlib.closing Indeed, like coroutines. http://www.python.org/dev/peps/pep-0342/ [Robert] >> The counter-argument was that >> servers could use non-blocking sockets to allow apps which read() to >> yield in the case of no immediate data rather than block indefinitely. Ah, but the problem with that is that one can't magically suspend methods like that and return control to the scheduler, without using coroutines or stackless. Who does the read() method return control to when there's no data available (i.e. no bytes on the socket). If wsgi.input is a simple file-like object, then it's methods must be coded to recognise, rather than blocking, when the data is not yet available to fulfill the applications expectation. How does it know how to return control to the scheduler, instead of the application? If the application expects to receive all of the data that it asked for with a, say read(1024) call, it has to be prepared to accept that it may get less than 1024 bytes, in an asynchronous situation. What does it return to the application in the case when < 1024 bytes is available? >> If a file-like object were retained, it would help to publish a >> chainable file example to help middleware re-stream files they read any >> part of. I don't think that re-streaming of input should be a part of the spec; it's an application layer thing. We don't expect to re-stream the output of an application: why re-stream the input? If some application needs to examine the entire byte sequence for whatever reasons, that's a special case that can be catered for with itertools, and dedicated middleware. >> Continuing deferred issues >> ?* Lifecycle methods (start/stop/etc event API driven by the container) I'd really like to get this one nailed: java people and .net people expect this stuff. Alan. From pje at telecommunity.com Thu Apr 2 03:54:38 2009 From: pje at telecommunity.com (P.J. Eby) Date: Wed, 01 Apr 2009 21:54:38 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com > References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> <4a951aa00904011553g53cb8ca2pbe1f98869c7949cb@mail.gmail.com> <88e286470904011611r1cd264ej5d7fa5c3a7377a4c@mail.gmail.com> Message-ID: <20090402015213.DCADE3A40A7@sparrow.telecommunity.com> At 10:11 AM 4/2/2009 +1100, Graham Dumpleton wrote: >Seems to me that you can't have it be configurable and it must always >be latin-1 interpretation. The problem is where you are composing >multiple WSGI applications. If they each have different expectations >or requirements as to how it is handled, aren't you going to have a >problem. Agreed. Configuration and duplication are both evil in this context. From janssen at parc.com Thu Apr 2 04:00:53 2009 From: janssen at parc.com (Bill Janssen) Date: Wed, 1 Apr 2009 19:00:53 PDT Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> Message-ID: <91243.1238637653@parc.com> Alan Kennedy wrote: > Hi Bill, > > [Bill] > > I think the controlling reference here is RFC 3875. > > I think the controlling references are RFC 2616, RFC 2396 and RFC 3987. I see what you're saying, but it's darn near impossible, as a practical matter, to get any guidance on encoding matters from those. The question is where those names come from, and they come from CGI, and that is (practically speaking) defined these days by RFC 3875, as much as anything. > I think the question is "are people using IRIs in the wild"? If so, > then we must decide how do we best deal with the problems of > recognising iso-8859-1+rfc2037 versus utf-8, or whatever > server-configured encoding the user has chosen. See http://bugs.python.org/issue3300, where we went around and around that question. The answer seems to be, yes. There are lots of useful fragments in that discussion, for instance: ``For the authority (server name) portion of a URI, RFC 3986 is pretty clear that UTF-8 must be used for non-ASCII values (assuming, for a moment, that IDNA addresses are not Punycode encoded already). For the path portion of URIs, a large-ish proportion of them are, indeed, UTF-8 encoded because that has been the de facto standard in Web browsers for a number of years now. For the query and fragment parts, however, the encoding is determined by context and often depends on the encoding of some page that contains the form from which the data is taken. Thus, a large number of URIs contain non-UTF-8 percent-encoded octets.'' Bill From graham.dumpleton at gmail.com Thu Apr 2 06:01:17 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 15:01:17 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <91243.1238637653@parc.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> Message-ID: <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> 2009/4/2 Bill Janssen : > Alan Kennedy wrote: > >> Hi Bill, >> >> [Bill] >> > I think the controlling reference here is RFC 3875. >> >> I think the controlling references are RFC 2616, RFC 2396 and RFC 3987. > > I see what you're saying, but it's darn near impossible, as a practical > matter, to get any guidance on encoding matters from those. > > The question is where those names come from, and they come from CGI, and > that is (practically speaking) defined these days by RFC 3875, as much as > anything. > >> I think the question is "are people using IRIs in the wild"? If so, >> then we must decide how do we best deal with the problems of >> recognising iso-8859-1+rfc2037 versus utf-8, or whatever >> server-configured encoding the user has chosen. > > See http://bugs.python.org/issue3300, where we went around and around > that question. ?The answer seems to be, yes. > > There are lots of useful fragments in that discussion, for instance: > > ``For the authority (server name) portion of a URI, RFC 3986 is > pretty clear that UTF-8 must be used for non-ASCII values (assuming, for > a moment, that IDNA addresses are not Punycode encoded already). For > the path portion of URIs, a large-ish proportion of them are, indeed, > UTF-8 encoded because that has been the de facto standard in Web browsers > for a number of years now. For the query and fragment parts, however, > the encoding is determined by context and often depends on the encoding > of some page that contains the form from which the data is taken. Thus, > a large number of URIs contain non-UTF-8 percent-encoded octets.'' Reading that bug detail (very long), reminds me of another sticky issue that was brought up before which is the Referrer (request) and Location (response) headers. These being URLs means you have to deal with the issue of encoding in the URL within a header. Is there going to be any simple answer to all of this? :-( Graham From sh at defuze.org Thu Apr 2 08:56:59 2009 From: sh at defuze.org (Sylvain Hellegouarch) Date: Thu, 2 Apr 2009 08:56:59 +0200 (CEST) Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> Message-ID: <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> Hi All, > Is there going to be any simple answer to all of this? :-( > Would there be any interest in asking the HTTP-BIS working group [1] what they think about it? Currently I couldn't find anything in their drafts suggesting they had decided to clarify this issue from a protocol's perspective but they might consider it to be relevant to their goals. - Sylvain [1] http://www.ietf.org/html.charters/httpbis-charter.html -- Sylvain Hellegouarch http://www.defuze.org From alan at xhaus.com Thu Apr 2 13:19:34 2009 From: alan at xhaus.com (Alan Kennedy) Date: Thu, 2 Apr 2009 06:19:34 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> Message-ID: <4a951aa00904020419pe98287ds9443f3bb32c03f27@mail.gmail.com> [Sylvain] > Would there be any interest in asking the HTTP-BIS working group [1] what > they think about it? > > Currently I couldn't find anything in their drafts suggesting they had > decided to clarify this issue from a protocol's perspective but they might > consider it to be relevant to their goals. > > - Sylvain > > [1] http://www.ietf.org/html.charters/httpbis-charter.html I checked the current version of their replacement for RFC 2616. It says """ 2.1.3. URI Comparison When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs """ Which doesn't work if the two URIs to be compared are in different encodings. I did find this page on the W3C site which at least explains the issues, and does a survey of existing modern browsers for how they encode URIs and IRIs. http://www.w3.org/International/articles/idn-and-iri/ """ Paths The conversion process for parts of the IRI relating to the path is already supported natively in the latest versions of IE7, Firefox, Opera, Safari and Google Chrome. It works in Internet Explorer 6 if the option in Tools>Internet Options>Advanced>Always send URLs as UTF-8 is turned on. This means that links in HTML, or addresses typed into the browser's address bar will be correctly converted in those user agents. It doesn't work out of the box for Firefox 2 (although you may obtain results if the IRI and the resource name are in the same encoding), but technically-aware users can turn on an option to support this (set network.standard-url.encode-utf8 to true in about:config). Whether or not the resource is found on the server, however, is a different question. If the file system is in UTF-8, there should be no problem. If not, and no mechanism is available to convert addresses from UTF-8 to the appropriate encoding, the request will fail. Files are normally exposed as UTF-8 by servers such as IIS and Apache 2 on Windows and Mac OS X. Unix and Linux users can store file names in UTF-8, or use the mod_fileiri module mentioned earlier. Version 1 of the Apache server doesn't yet expose filenames as UTF-8. You can run a basic check whether it works for your client and resource using this simple test. Note that, while the basics may work, there are other somewhat more complicated aspects of IRI support, such as handling of bidirectional text in Arabic or Hebrew, which may need some additional time for full implementation. """ Alan. From graham.dumpleton at gmail.com Thu Apr 2 13:33:07 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 2 Apr 2009 22:33:07 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> Message-ID: <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> 2009/4/2 Graham Dumpleton : > Is there going to be any simple answer to all of this? :-( I am slowly working through what I think I at least need to do for Apache/mod_wsgi. I'll give a summary of what I have worked out so far based on the discussions and my own research. Just so I have a list of things to check off, I include an example WSGI environment from a request and make comments about each category of things from it. First off is CGI HTTP variables. HTTP_ACCEPT: 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' HTTP_ACCEPT_ENCODING: 'gzip, deflate' HTTP_ACCEPT_LANGUAGE: 'en-us' HTTP_CONNECTION: 'keep-alive' HTTP_HOST: 'home.dscpl.com.au' HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1' The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is: """When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047)""" Which is fair enough and basically what the RFCs say. At the moment I don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just need to do that. An interesting one here to note is HTTP_HOST. The issue with this one is what would happen for a unicode host name. For Apache an IDNA (RFC3490) encoded host name has to be used to identify a site with unicode host name. That is, one uses the IDNA name for ServerName or ServerAlias directives. When one gets a request one would actually see the IDNA name for HTTP_HOST and that only uses latin-1 characters. For example: HTTP_HOST: 'xn--wgbe9chb01aytce.com' These resolve in DNS okay: $ nslookup xn--wgbe9chb01aytce.com Server: 192.168.1.254 Address: 192.168.1.254#53 Non-authoritative answer: Name: xn--wgbe9chb01aytce.com Address: 208.78.242.184 Using HTTP live headers on Firefox can also confirm that that is what would be sent: Host: xn--wgbe9chb01aytce.com My understanding is that if a actual unicode string is given to a browser, that it should translate it to the IDNA name before use. Next HTTP header to worry about is HTTP_REFERRER. There would be two parts to this, there would be the host name component and then the path component. We already know from above that for unicode host name it should be the IDNA name. For the path component, if the client follows the rules properly, then if the path uses a non latin-1 encoding, then it should be using RFC 2047 to indicate this so shouldn't have to do anything different and use same rule as other HTTP headers. For this header we are actually in a better situation that for URL in actual HTTP request line which isn't so specific about encodings. GATEWAY_INTERFACE: 'CGI/1.1' SERVER_PROTOCOL: 'HTTP/1.1' Standard stuff which is always going to be latin-1, so encode as that. REMOTE_ADDR: '192.168.1.5' REMOTE_PORT: '51378' SERVER_PORT: '80' SERVER_ADDR: '192.168.1.5' Again, latin-1 is okay. SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1' Again, latin-1 is okay as Apache modules internally can only supply normal C strings to add stuff to this. SERVER_NAME: 'home.dscpl.com.au' Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so can use latin-1 okay. SERVER_ADMIN: 'you at example.com' This is set by ServerAdmin directive. Because in Apache configuration is effectively latin-1, probably can't even define a non latin-1 email address. For host part, probably IDNA encoded anyway, so restriction on latin-1 only perhaps pertinent to user part of email address. So, latin-1 should be okay. SERVER_SIGNATURE: '' Depending on Apache configuration can be server name and version information or server admin email address. All latin-1. DOCUMENT_ROOT: '/Library/WebServer/Documents' SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi' These are file system paths, and since the Apache Runtime Library used for Apache 2.X has a define for whether file system supports unicode, can say: #if APR_HAS_UNICODE_FS charset = "UTF-8"; #else charset = "ISO-8859-1"; #endif For Apache 1.3, which doesn't have that define AFAIK, might just have to assume latin-1, but possibly another way of doing it, or Apache 1.3 might have its own define for it. PATH: '/usr/bin:/bin:/usr/sbin:/sbin' Presume I can use APR_HAS_UNICODE_FS check again even though it is a combination of paths. REQUEST_METHOD: 'GET' Presume they will always use latin-1 for these. All that is now left is the following, which we have already been discussing. REQUEST_URI: '/~grahamd/echo.wsgi' SCRIPT_NAME: '/~grahamd/echo.wsgi' PATH_INFO: '' QUERY_STRING: '' At least I am happy that except for these four, that there shouldn't be any issues. I'll keep watching what others come up with in respect of these and see what consensus develops. :-) Graham From alan at xhaus.com Thu Apr 2 14:49:08 2009 From: alan at xhaus.com (Alan Kennedy) Date: Thu, 2 Apr 2009 07:49:08 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> Message-ID: <4a951aa00904020549q30c3f088ja00b27cc0b9dd4c8@mail.gmail.com> [Sylvain] > Would there be any interest in asking the HTTP-BIS working group [1] what > they think about it? > > Currently I couldn't find anything in their drafts suggesting they had > decided to clarify this issue from a protocol's perspective but they might > consider it to be relevant to their goals. > > - Sylvain > > [1] http://www.ietf.org/html.charters/httpbis-charter.html As mentioned in an earlier post, I think their current spec avoids the issue, by still relying on "octet-by-octet" comparison. But I did come across this discussion on their list, which goes into all of the issues in fine detail. http://www.nabble.com/PROPOSAL%3A-i74%3A-Encoding-for-non-ASCII-headers-tt16274487.html#a16291951 Quote of the thread [Roy Fielding] > We are simply passing through the one and only defined i18n solution > for HTTP/1.1 because it was the only solution available in 1994. > If email clients can (and do) implement it, then so can WWW clients. > > People who want to fix that should start queueing for HTTP/1.2. Alan. From foom at fuhm.net Thu Apr 2 17:45:45 2009 From: foom at fuhm.net (James Y Knight) Date: Thu, 2 Apr 2009 11:45:45 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: On Apr 2, 2009, at 7:33 AM, Graham Dumpleton wrote: > """When running under Python 3, servers MUST provide CGI HTTP > variables as strings, decoded from the headers using HTTP standard > encodings (i.e. latin-1 + RFC 2047)""" > > Which is fair enough and basically what the RFCs say. At the moment I > don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just > need to do that. I'd really *really* like to recommend that any mention of RFC 2047 is stricken from the WSGI server requirements. I cannot imagine that decoding actually accomplishing anything other than opening security holes (think a filter in an upstream proxy that doesn't know how to do 2047-decoding passing something through that you now decode.) Also, you have to only do the decoding on TEXT words according to the spec, so the WSGI container now needs an HTTP header parser just in order to determine where it should decode RFC2047 words and where not to? I don't think so... James From tseaver at palladion.com Thu Apr 2 19:36:53 2009 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 02 Apr 2009 13:36:53 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Graham Dumpleton wrote: > 2009/4/2 Graham Dumpleton : >> Is there going to be any simple answer to all of this? :-( > > I am slowly working through what I think I at least need to do for > Apache/mod_wsgi. I'll give a summary of what I have worked out so far > based on the discussions and my own research. > > Just so I have a list of things to check off, I include an example > WSGI environment from a request and make comments about each category > of things from it. > > First off is CGI HTTP variables. > > HTTP_ACCEPT: 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' > HTTP_ACCEPT_ENCODING: 'gzip, deflate' > HTTP_ACCEPT_LANGUAGE: 'en-us' > HTTP_CONNECTION: 'keep-alive' > HTTP_HOST: 'home.dscpl.com.au' > HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; > en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 > Safari/525.27.1' > > The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is: > > """When running under Python 3, servers MUST provide CGI HTTP > variables as strings, decoded from the headers using HTTP standard > encodings (i.e. latin-1 + RFC 2047)""" > > Which is fair enough and basically what the RFCs say. At the moment I > don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just > need to do that. > > An interesting one here to note is HTTP_HOST. The issue with this one > is what would happen for a unicode host name. For Apache an IDNA > (RFC3490) encoded host name has to be used to identify a site with > unicode host name. That is, one uses the IDNA name for ServerName or > ServerAlias directives. > > When one gets a request one would actually see the IDNA name for > HTTP_HOST and that only uses latin-1 characters. For example: > > HTTP_HOST: 'xn--wgbe9chb01aytce.com' > > These resolve in DNS okay: > > $ nslookup xn--wgbe9chb01aytce.com > Server: 192.168.1.254 > Address: 192.168.1.254#53 > > Non-authoritative answer: > Name: xn--wgbe9chb01aytce.com > Address: 208.78.242.184 > > Using HTTP live headers on Firefox can also confirm that that is what > would be sent: > > Host: xn--wgbe9chb01aytce.com > > My understanding is that if a actual unicode string is given to a > browser, that it should translate it to the IDNA name before use. That is what the RFCs require, as well as the fact that un-encoded unicode can't be written onto a socket. > Next HTTP header to worry about is HTTP_REFERRER. > > There would be two parts to this, there would be the host name > component and then the path component. > > We already know from above that for unicode host name it should be the > IDNA name. > > For the path component, if the client follows the rules properly, then > if the path uses a non latin-1 encoding, then it should be using RFC > 2047 to indicate this so shouldn't have to do anything different and > use same rule as other HTTP headers. For this header we are actually > in a better situation that for URL in actual HTTP request line which > isn't so specific about encodings. > > GATEWAY_INTERFACE: 'CGI/1.1' > SERVER_PROTOCOL: 'HTTP/1.1' > > Standard stuff which is always going to be latin-1, so encode as that. I think you mean 'decode' here? Unicode strings are encode to get bytes; bytes are decoded to get unicode strings. Also, I don't know of any reason why those values can be anything but ASCII. > REMOTE_ADDR: '192.168.1.5' > REMOTE_PORT: '51378' > SERVER_PORT: '80' > SERVER_ADDR: '192.168.1.5' > > Again, latin-1 is okay. Likewise, these can't be anything but ASCII. > SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l > DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1' > > Again, latin-1 is okay as Apache modules internally can only supply > normal C strings to add stuff to this. > > SERVER_NAME: 'home.dscpl.com.au' > > Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so > can use latin-1 okay. > > SERVER_ADMIN: 'you at example.com' > > This is set by ServerAdmin directive. Because in Apache configuration > is effectively latin-1, probably can't even define a non latin-1 email > address. For host part, probably IDNA encoded anyway, so restriction > on latin-1 only perhaps pertinent to user part of email address. So, > latin-1 should be okay. > > SERVER_SIGNATURE: '' > > Depending on Apache configuration can be server name and version > information or server admin email address. All latin-1. > > DOCUMENT_ROOT: '/Library/WebServer/Documents' > SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi' > > These are file system paths, and since the Apache Runtime Library used > for Apache 2.X has a define for whether file system supports unicode, > can say: > > #if APR_HAS_UNICODE_FS > charset = "UTF-8"; > #else > charset = "ISO-8859-1"; > #endif I'm not sure that works for arbitrary filesystem configurations: some parts of the tree may be mounted from locations with different encodings. See David Wheeler's analysis for more: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html > For Apache 1.3, which doesn't have that define AFAIK, might just have > to assume latin-1, but possibly another way of doing it, or Apache 1.3 > might have its own define for it. > > PATH: '/usr/bin:/bin:/usr/sbin:/sbin' > > Presume I can use APR_HAS_UNICODE_FS check again even though it is a > combination of paths. > > REQUEST_METHOD: 'GET' > > Presume they will always use latin-1 for these. RFC 2616, section 5.1.1 defines only ASCII methods; extension methods are 'tokens', which must also be printable ASCII w/o separateros (section 2.2). > All that is now left is the following, which we have already been discussing. > > REQUEST_URI: '/~grahamd/echo.wsgi' > SCRIPT_NAME: '/~grahamd/echo.wsgi' > PATH_INFO: '' > QUERY_STRING: '' > > At least I am happy that except for these four, that there shouldn't > be any issues. > > I'll keep watching what others come up with in respect of these and > see what consensus develops. :-) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ1Pe1+gerLs4ltQ4RArt6AJ9GMmvjQd6LfH4MSC1yzNUTO6r51ACg3Ocl 3bOgMrQUlFy+ZSehv8gsSLM= =r4vt -----END PGP SIGNATURE----- From tseaver at palladion.com Thu Apr 2 19:40:38 2009 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 02 Apr 2009 13:40:38 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: <49D4F896.5010302@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Y Knight wrote: > On Apr 2, 2009, at 7:33 AM, Graham Dumpleton wrote: > >> """When running under Python 3, servers MUST provide CGI HTTP >> variables as strings, decoded from the headers using HTTP standard >> encodings (i.e. latin-1 + RFC 2047)""" >> >> Which is fair enough and basically what the RFCs say. At the moment I >> don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just >> need to do that. > > I'd really *really* like to recommend that any mention of RFC 2047 is > stricken from the WSGI server requirements. I cannot imagine that > decoding actually accomplishing anything other than opening security > holes (think a filter in an upstream proxy that doesn't know how to do > 2047-decoding passing something through that you now decode.) > > Also, you have to only do the decoding on TEXT words according to the > spec, so the WSGI container now needs an HTTP header parser just in > order to determine where it should decode RFC2047 words and where not > to? I don't think so... Couldn't the spec mandate that decoding RFC 2047 headers is the responsibility of the non-middleware WSGI server? I agree that middleware and applications shouldn't know ore care about that problem. Under Python 2.x, the server would transcode those values to the "common" encoding used for all values in the WSGI environment; under Python 3.x, it would just decode them to unicode. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ1PiW+gerLs4ltQ4RAhUmAJ94N6nC+Lh5qPX2Zrz2zAmZgZlnPgCfVZYU Z0xaYW6NwFJ35Xa11HRXuDw= =w/6q -----END PGP SIGNATURE----- From foom at fuhm.net Thu Apr 2 20:09:19 2009 From: foom at fuhm.net (James Y Knight) Date: Thu, 2 Apr 2009 14:09:19 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <49D4F896.5010302@palladion.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> <49D4F896.5010302@palladion.com> Message-ID: <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> On Apr 2, 2009, at 1:40 PM, Tres Seaver wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > James Y Knight wrote: >> On Apr 2, 2009, at 7:33 AM, Graham Dumpleton wrote: >> >>> """When running under Python 3, servers MUST provide CGI HTTP >>> variables as strings, decoded from the headers using HTTP standard >>> encodings (i.e. latin-1 + RFC 2047)""" >>> >>> Which is fair enough and basically what the RFCs say. At the >>> moment I >>> don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so >>> just >>> need to do that. >> >> I'd really *really* like to recommend that any mention of RFC 2047 is >> stricken from the WSGI server requirements. I cannot imagine that >> decoding actually accomplishing anything other than opening security >> holes (think a filter in an upstream proxy that doesn't know how to >> do >> 2047-decoding passing something through that you now decode.) >> >> Also, you have to only do the decoding on TEXT words according to the >> spec, so the WSGI container now needs an HTTP header parser just in >> order to determine where it should decode RFC2047 words and where not >> to? I don't think so... > > Couldn't the spec mandate that decoding RFC 2047 headers is the > responsibility of the non-middleware WSGI server? I agree that > middleware and applications shouldn't know ore care about that > problem. > Under Python 2.x, the server would transcode those values to the > "common" encoding used for all values in the WSGI environment; under > Python 3.x, it would just decode them to unicode. > I think you're saying you agree with exactly the opposite of what I meant. The server/gateway (aka apache mod_wsgi) *must not* be required to handle RFC2047 decoding. Only the application (or a header parsing library that the application uses) can possibly handle this properly. That's why I think it should not be mentioned at all in the WSGI requirements for the server. Furthermore, although they certainly can if they want, I'd recommend that no applications actually bother with doing such decoding, since RFC2047 words in http headers are essentially never used. james From tseaver at palladion.com Thu Apr 2 20:33:08 2009 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 02 Apr 2009 14:33:08 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> <49D4F896.5010302@palladion.com> <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Y Knight wrote: > I think you're saying you agree with exactly the opposite of what I > meant. The server/gateway (aka apache mod_wsgi) *must not* be required > to handle RFC2047 decoding. Only the application (or a header parsing > library that the application uses) can possibly handle this properly. I don't understand why: if RFC2047 values are being passed as HTTP headers, then surely the server has enough information to decode them, and to ensure that they are re-encoded into the same encoding as all other WSGI enviornment variables (under Python 2.x). Ensuring that the enviornment variables are uniformly encoded (or decoded to unicode, in Python 3.x) seems like it *must* be the server's responsiblity: only the server can know how some values (e.g., those derived from filesystem paths, or its config file) are encoded. Moving that responsibility to the application just means that it won't be met, because the application won't have enough information to do the job. > That's why I think it should not be mentioned at all in the WSGI > requirements for the server. > > Furthermore, although they certainly can if they want, I'd recommend > that no applications actually bother with doing such decoding, since > RFC2047 words in http headers are essentially never used. In that case, it would be moot whether the server or the application does (not) do the decoding. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ1QTk+gerLs4ltQ4RApDgAJ4olI0e3Jh1diP9P6se5RR3mfFFIACaA05t n8UK1XWG2ibMTiqXEeGr6mw= =JNXk -----END PGP SIGNATURE----- From foom at fuhm.net Thu Apr 2 21:56:44 2009 From: foom at fuhm.net (James Y Knight) Date: Thu, 2 Apr 2009 15:56:44 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> <49D4F896.5010302@palladion.com> <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> Message-ID: On Apr 2, 2009, at 2:33 PM, Tres Seaver wrote: > I don't understand why: if RFC2047 values are being passed as HTTP > headers, then surely the server has enough information to decode them, > and to ensure that they are re-encoded into the same encoding as all > other WSGI enviornment variables (under Python 2.x). Just so long as the gateway server has an HTTP header parsing implementation and global knowledge of all HTTP headers, including private ones. Consider: FooBar: =?utf-8?q?some-text?= Should that be decoded with RFC2047 rules? Answer: it depends. Does the grammar for FooBar say that the contents is of type TEXT? Maybe it just *looks* like an encoded-word but is actually just a sequence of tokens and separators which have an entirely different meaning for that header. You simply can't tell without the grammar for the FooBar header. James From tseaver at palladion.com Thu Apr 2 23:08:54 2009 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 02 Apr 2009 17:08:54 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> <49D4F896.5010302@palladion.com> <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Y Knight wrote: > On Apr 2, 2009, at 2:33 PM, Tres Seaver wrote: >> I don't understand why: if RFC2047 values are being passed as HTTP >> headers, then surely the server has enough information to decode them, >> and to ensure that they are re-encoded into the same encoding as all >> other WSGI enviornment variables (under Python 2.x). > > Just so long as the gateway server has an HTTP header parsing > implementation and global knowledge of all HTTP headers, including > private ones. > > Consider: > FooBar: =?utf-8?q?some-text?= > > Should that be decoded with RFC2047 rules? Answer: it depends. Does > the grammar for FooBar say that the contents is of type TEXT? Maybe it > just *looks* like an encoded-word but is actually just a sequence of > tokens and separators which have an entirely different meaning for > that header. You simply can't tell without the grammar for the FooBar > header. A couple of things: - - That header is not even allowed by the HTTP RFC's, AFAIK. "Custom" headers need the 'X-' prefix. - - I could imagine a server option which disabled decoding for a specific subset of custom headers, but can't imagine needing it in any real application. - - Leaving the WSGI environment a hodgepodge of differently-encoded junk makes *every* application have to deal with that stuff. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ1Slm+gerLs4ltQ4RAtkvAJ9SM8P9YmB/D3JleoY/0C7kVMl5MgCbBMCb +YavShebeoJU5Ijjc394LCQ= =BuI3 -----END PGP SIGNATURE----- From graham.dumpleton at gmail.com Fri Apr 3 00:27:21 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 3 Apr 2009 09:27:21 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> <49D4F896.5010302@palladion.com> <982981EC-02C1-4C85-AAC9-3CAA0D3721E9@fuhm.net> Message-ID: <88e286470904021527o2fec14d1sae956379cd2e057a@mail.gmail.com> 2009/4/3 James Y Knight : > > On Apr 2, 2009, at 1:40 PM, Tres Seaver wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> James Y Knight wrote: >>> >>> On Apr 2, 2009, at 7:33 AM, Graham Dumpleton wrote: >>> >>>> """When running under Python 3, servers MUST provide CGI HTTP >>>> variables as strings, decoded from the headers using HTTP standard >>>> encodings (i.e. latin-1 + RFC 2047)""" >>>> >>>> Which is fair enough and basically what the RFCs say. At the moment I >>>> don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just >>>> need to do that. >>> >>> I'd really *really* like to recommend that any mention of RFC 2047 is >>> stricken from the WSGI server requirements. I cannot imagine that >>> decoding actually accomplishing anything other than opening security >>> holes (think a filter in an upstream proxy that doesn't know how to do >>> 2047-decoding passing something through that you now decode.) >>> >>> Also, you have to only do the decoding on TEXT words according to the >>> spec, so the WSGI container now needs an HTTP header parser just in >>> order to determine where it should decode RFC2047 words and where not >>> to? I don't think so... >> >> Couldn't the spec mandate that decoding RFC 2047 headers is the >> responsibility of the non-middleware WSGI server? ?I agree that >> middleware and applications shouldn't know ore care about that problem. >> Under Python 2.x, the server would transcode those values to the >> "common" encoding used for all values in the WSGI environment; ?under >> Python 3.x, it would just decode them to unicode. >> > > I think you're saying you agree with exactly the opposite of what I meant. > The server/gateway (aka apache mod_wsgi) *must not* be required to handle > RFC2047 decoding. Only the application (or a header parsing library that the > application uses) can possibly handle this properly. > > That's why I think it should not be mentioned at all in the WSGI > requirements for the server. > > Furthermore, although they certainly can if they want, I'd recommend that no > applications actually bother with doing such decoding, since RFC2047 words > in http headers are essentially never used. Having the WSGI adapter ignore it would be fine by me, as it then effectively mirrors the current behaviour of Python 2.X. That is, in Python 2.X the WSGI application would have to deal with them anyway. If RFC2047 comes into play in response headers as well, then also the WSGI application's responsibility there given that it should be returning bytes for response headers and so would therefore have had to apply such an encoding if necessary anyway. For WSGI 1.0 and Python 3.0 can therefore possibly maintain the status quo, or as close as possible, with Python 2.X behaviour. If we want to think about changing it, then address it in WSGI 2.0 where more significant changes being made anyway. Better that than for WSGI 1.0 and Python 2.X and Python 3.0 having different requirements. Graham From graham.dumpleton at gmail.com Fri Apr 3 00:34:13 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 3 Apr 2009 09:34:13 +1100 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: <88e286470904021534h11986b4bj90a6da309a67530e@mail.gmail.com> 2009/4/3 Tres Seaver : >> DOCUMENT_ROOT: '/Library/WebServer/Documents' >> SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi' >> >> These are file system paths, and since the Apache Runtime Library used >> for Apache 2.X has a define for whether file system supports unicode, >> can say: >> >> ? #if APR_HAS_UNICODE_FS >> ? ? ? ? charset = "UTF-8"; >> ? #else >> ? ? ? ? charset = "ISO-8859-1"; >> ? #endif > > I'm not sure that works for arbitrary filesystem configurations: ?some > parts of the tree may be mounted from locations with different > encodings. ?See David Wheeler's analysis for more: > > ?http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html Yes, am aware that it isn't that simple. I can make that the default and like I have a configuration directive for case sensitivity in file systems: http://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGICaseSensitivity I can add one related to file system encoding. This would be similar to how some other Apache modules allow overriding it. For example: http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxyftpdircharset Graham From fumanchu at aminus.org Fri Apr 3 03:49:49 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Thu, 2 Apr 2009 18:49:49 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <91243.1238637653@parc.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><86217.1238608796@parc.com><4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> Message-ID: Bill Janssen wrote: > Alan Kennedy wrote: > > [Bill] > > > I think the controlling reference here is RFC 3875. > > > > I think the controlling references are RFC 2616, RFC 2396 > > and RFC 3987. > > I see what you're saying, but it's darn near impossible, as a practical > matter, to get any guidance on encoding matters from those. > > The question is where those names come from, and they come from CGI, > and that is (practically speaking) defined these days by RFC 3875, > as much as anything. If so, then PEP 333 really should be updated to point at a version of the CGI "spec" that doesn't reference e.g. RFC 1808 for URI's. As it is, one could easily come to the conclusion that, for example, path parameters like /path;a=3 aren't supported (because the CGI draft that PEP 333 mentions disallows them). I'd be much happier referring to 3875, and even happier diverging from strict compliance to what was always a shaky spec. Robert Brewer fumanchu at aminus.org From pje at telecommunity.com Fri Apr 3 17:16:13 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 03 Apr 2009 11:16:13 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.co m> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470904011351l4952262bt6fbf72ef0557ca16@mail.gmail.com> Message-ID: <20090403151347.B40613A40B0@sparrow.telecommunity.com> At 07:51 AM 4/2/2009 +1100, Graham Dumpleton wrote: >If we are going to carry values in two different formats, Let's try not to do that, either. ;-) From fumanchu at aminus.org Fri Apr 3 20:35:20 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 3 Apr 2009 11:35:20 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> Message-ID: Alan Kennedy wrote: > [Bill] > > I think the controlling reference here is RFC 3875. > > I think the controlling references are RFC 2616, RFC 2396 and RFC 3987. > > RFC 2616, the HTTP 1.1 spec, punts on the question of character > encoding for the request URI. > > RFC 2396, the URI spec, says > > """ > It is expected that a systematic treatment of character encoding > within URI will be developed as a future modification of this > specification. > """ > > RFC 3987 is that spec, for Internationalized Resource Identifiers. It > says > > """ > An IRI is a sequence of characters from the Universal Character Set > (Unicode/ISO 10646). > """ > > and > > """ > 1.2. Applicability > > IRIs are designed to be compatible with recommendations for new URI > schemes [RFC2718]. The compatibility is provided by specifying a > well-defined and deterministic mapping from the IRI character > sequence to the functionally equivalent URI character sequence. > Practical use of IRIs (or IRI references) in place of URIs (or URI > references) depends on the following conditions being met: > """ > > followed by > > """ > c. The URI corresponding to the IRI in question has to encode > original characters into octets using UTF-8. For new URI > schemes, this is recommended in [RFC2718]. It can apply to a > whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], > or the URN syntax [RFC2141]). It can apply to a specific part > of > a URI, such as the fragment identifier (e.g., [XPointer]). It > can apply to a specific URI or part(s) thereof. For details, > please see section 6.4. > """ > > I think the question is "are people using IRIs in the wild"? If so, > then we must decide how do we best deal with the problems of > recognising iso-8859-1+rfc2037 versus utf-8, or whatever > server-configured encoding the user has chosen. Agreed. The Request-URI needs to handle IRI's. The headers mostly don't--almost all headers are of mostly type "token", which is US-ASCII. A few are of type "TEXT", which is ISO-8859-1/RFC 2047. The remaining (sub)values are mostly custom byte sequences: field-name field-value ---------- ----------- Accept token Accept-Charset token Accept-Encoding token Accept-Language ALPHA, plus ":", "=", "q" etc Accept-Ranges token Age DIGIT Allow token Authorization token Cache-Control token Connection token Content-Encoding token Content-Language ALPHA Content-Length DIGIT Content-Location absoluteURI | relativeURI Content-MD5 base64 of 128 bit md5 digest Content-Range DIGIT, plus "/" etc Content-Type token Date HTTP-date ETag TEXT and CHAR Expect token, quoted-string Expires HTTP-date >From ASCII (see RFC 822) Host host ":" port If-Match TEXT and CHAR If-Modified-Since HTTP-date If-None-Match TEXT and CHAR If-Range TEXT and CHAR | HTTP-date If-Unmodified-Since HTTP-date Last-Modified HTTP-date Location absoluteURI Max-Forwards DIGIT Pragma token, quoted-string Proxy-Authenticate token Proxy-Authorization token Range token Referer absoluteURI | relativeURI Retry-After HTTP-date | DIGIT Server token, TEXT TE token Trailer token Transfer-Encoding token Upgrade token User-Agent token, TEXT Vary token Via token, host, port Warning quoted-string, HTTP-date, host, port WWW-Authenticate token The Content-Location, Location, and Referer headers are problematic since HTTP borrows those from the URI spec, which deals in characters and not bytes, as you mentioned. Host, and maybe Via, are also special due to possible IDNA-encoding. Regarding extension headers, I think we should assume that the HTTP/1.1 spec implies all headers should be token (ASCII) or TEXT (ISO-8859-1). >From section 4.2: field-content = In addition, the httpbis effort seems to be enforcing this even more strongly [1]: message-header = field-name ":" OWS [ field-value ] OWS field-name = token field-value = *( field-content / OWS ) field-content = *( WSP / VCHAR / obs-text ) Historically, HTTP has allowed field-content with text in the ISO- 8859-1 [ISO-8859-1] character encoding (allowing other character sets through use of [RFC2047] encoding). In practice, most HTTP header field-values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD constrain their field-values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field-content as opaque data. So, from where I sit, we have: 1. Many header values which are ASCII. 2. A few header values which are ISO-8859-1 plus RFC 2047. 3. A few header values which are URI's (no specified encoding) or IRI's (UTF-8). I understand the desire to decode ASAP, and I agree with Guido that we should use a default encoding which the app can override. Looking at the above, ISO-8859-1 is the best encoding I know of for all three header cases. ASCII can be used as a valid subset without transcoding; headers which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be transcoded by the app if needed, but mangled opaquely by middleware. If we make *that* call, then IMO there's no reason not to do the same to SCRIPT_NAME, PATH_INFO, and QUERY_STRING. Robert Brewer fumanchu at aminus.org [1] http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p1-messaging-06.t xt From fumanchu at aminus.org Fri Apr 3 20:43:32 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 3 Apr 2009 11:43:32 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><86217.1238608796@parc.com><4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com><91243.1238637653@parc.com><88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > I am slowly working through what I think I at least need to do for > Apache/mod_wsgi. I'll give a summary of what I have worked out so far > based on the discussions and my own research. > ... > Next HTTP header to worry about is HTTP_REFERRER. > > There would be two parts to this, there would be the host name > component and then the path component. > > We already know from above that for unicode host name it should be the > IDNA name. > > For the path component, if the client follows the rules properly, then > if the path uses a non latin-1 encoding, then it should be using RFC > 2047 to indicate this so shouldn't have to do anything different and > use same rule as other HTTP headers. For this header we are actually > in a better situation that for URL in actual HTTP request line which > isn't so specific about encodings. I don't think that's true. Referer must be absoluteURI or relativeURI, neither of which have defined encodings. RFC 2047 only applies to headers of type TEXT, of which there are surprisingly few. Robert Brewer fumanchu at aminus.org From fumanchu at aminus.org Fri Apr 3 20:46:04 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 3 Apr 2009 11:46:04 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com><86217.1238608796@parc.com><4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com><91243.1238637653@parc.com><88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com><88e286470904020433l5da48074i8918bddb6f0d67@mail.gmail.com> Message-ID: James Y Knight wrote: > On Apr 2, 2009, at 7:33 AM, Graham Dumpleton wrote: > > > """When running under Python 3, servers MUST provide CGI HTTP > > variables as strings, decoded from the headers using HTTP standard > > encodings (i.e. latin-1 + RFC 2047)""" > > > > Which is fair enough and basically what the RFCs say. At the moment I > > don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just > > need to do that. > > I'd really *really* like to recommend that any mention of RFC 2047 is > stricken from the WSGI server requirements. I cannot imagine that > decoding actually accomplishing anything other than opening security > holes (think a filter in an upstream proxy that doesn't know how to do > 2047-decoding passing something through that you now decode.) > > Also, you have to only do the decoding on TEXT words according to the > spec, so the WSGI container now needs an HTTP header parser just in > order to determine where it should decode RFC2047 words and where not > to? I don't think so... Something needs to decode RFC2047 words, at least until http-bis is widespread. I'd be OK with making the app do it as needed (since only it might know whether extension headers are token/quoted-string/TEXT). Robert Brewer fumanchu at aminus.org From deron.meranda at gmail.com Fri Apr 3 23:22:11 2009 From: deron.meranda at gmail.com (Deron Meranda) Date: Fri, 3 Apr 2009 17:22:11 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> Message-ID: <5c06fa770904031422h2d164081id28651d49230ff3c@mail.gmail.com> > ... The Request-URI needs to handle IRI's. The headers mostly > don't--almost all headers are of mostly type "token", which is US-ASCII. > A few are of type "TEXT", which is ISO-8859-1/RFC 2047. The remaining > (sub)values are mostly custom byte sequences: ... Also don't forget about the still-in-draft Link header that is getting a lot of attention currently (especially at it relates to resource discovery). http://tools.ietf.org/id/draft-nottingham-http-link-header-04.txt It includes IRIs, along with some other information. -- Deron Meranda From randy at rcs-comp.com Sat Apr 4 22:08:27 2009 From: randy at rcs-comp.com (Randy Syring) Date: Sat, 04 Apr 2009 16:08:27 -0400 Subject: [Web-SIG] Reverse Proxy & HTTPS Message-ID: <49D7BE3B.5040709@rcs-comp.com> I have a Python application that I want to run with the CherryPy WSGI Server. My intention is to let the CherryPy server run on a non standard port (say 9001) and then let IIS (yes, I know what you are thinking, but that is what I have to work with) reverse proxy the website requests to CherryPy. However, I am wondering how I should handle HTTPS. Currently, there are only a few pages in my app that need HTTPS. When running the app natively in IIS, if one of those pages is requested using HTTP, I will issue a HTTP header redirect to the HTTPS page. How should I handle this in a reverse proxy situation? What I mean is, how do I detect in my Python app if the original request to IIS is using SSL? I don't want to have to run SSL on the connection from IIS to CherryPy. I am thinking I could modify the headers to the CherryPy server adding something like "X-is-ssl" and then use middleware on the python side to set wsgi.url_scheme appropriately. I just don't know the HTTP standard well enough to know how this kind of thing should be handled. Thank you for any help you can provide. -- -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 From cs at zip.com.au Sun Apr 5 01:55:06 2009 From: cs at zip.com.au (Cameron Simpson) Date: Sun, 5 Apr 2009 09:55:06 +1000 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <49D7BE3B.5040709@rcs-comp.com> Message-ID: <20090404235506.GA23458@cskk.homeip.net> On 04Apr2009 16:08, Randy Syring wrote: > I have a Python application that I want to run with the CherryPy WSGI > Server. My intention is to let the CherryPy server run on a non > standard port (say 9001) and then let IIS (yes, I know what you are > thinking, but that is what I have to work with) reverse proxy the > website requests to CherryPy. > > However, I am wondering how I should handle HTTPS. Currently, there are > only a few pages in my app that need HTTPS. When running the app > natively in IIS, if one of those pages is requested using HTTP, I will > issue a HTTP header redirect to the HTTPS page. How should I handle > this in a reverse proxy situation? What I mean is, how do I detect in > my Python app if the original request to IIS is using SSL? I don't want > to have to run SSL on the connection from IIS to CherryPy. > > I am thinking I could modify the headers to the CherryPy server adding > something like "X-is-ssl" and then use middleware on the python side to > set wsgi.url_scheme appropriately. I just don't know the HTTP standard > well enough to know how this kind of thing should be handled. How tightly knit is the IIS i.e. do you have control over it? Maybe this rewrite thing should be set up in IIS instead, it seems the more obvious place for such control except that the rewrite config would no longer be "part of the app". At least the IIS server should know if it's http or https. Or are you wanting to make your CherryPy app robust against http misuse? Disclaimer: I know close to nothing about IIS; this is just how I'd be approaching it with an Apache reverse proxy from end. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ From sh at defuze.org Mon Apr 6 14:11:30 2009 From: sh at defuze.org (Sylvain Hellegouarch) Date: Mon, 6 Apr 2009 14:11:30 +0200 (CEST) Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4a951aa00904020419pe98287ds9443f3bb32c03f27@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <91243.1238637653@parc.com> <88e286470904012101q6a4b42fbwc1694594361cc3ea@mail.gmail.com> <54714.193.253.216.132.1238655419.squirrel@mail1.webfaction.com> <4a951aa00904020419pe98287ds9443f3bb32c03f27@mail.gmail.com> Message-ID: <62303.193.253.216.132.1239019890.squirrel@mail1.webfaction.com> Probably of interest in regards to this discussion: http://lists.w3.org/Archives/Public/ietf-http-wg/2009AprJun/0057.html http://trac.tools.ietf.org/wg/httpbis/trac/ticket/63 This applies to headers but probably shows that RFC 2047 is gradually ruled out of HTTP. - Sylvain -- Sylvain Hellegouarch http://www.defuze.org From randy at rcs-comp.com Mon Apr 6 18:24:42 2009 From: randy at rcs-comp.com (Randy Syring) Date: Mon, 06 Apr 2009 12:24:42 -0400 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <20090404235506.GA23458@cskk.homeip.net> References: <20090404235506.GA23458@cskk.homeip.net> Message-ID: <49DA2CCA.3030402@rcs-comp.com> Cameron Simpson wrote: > On 04Apr2009 16:08, Randy Syring wrote: > > How tightly knit is the IIS i.e. do you have control over it? Maybe this > rewrite thing should be set up in IIS instead, it seems the more obvious > place for such control except that the rewrite config would no longer > be "part of the app". At least the IIS server should know if it's http > or https. Or are you wanting to make your CherryPy app robust against > http misuse? > > Disclaimer: I know close to nothing about IIS; this is just how I'd be > approaching it with an Apache reverse proxy from end. > > Cheers, > Cameron, Thanks for your reply. Let me start out by saying that I don't think this is an IIS issue, its just that IIS is the front-end web server that is proxying the HTTP requests through to the CherryPy server. If I was to choose to run a similar setup on a Linux box with Apache, I still think I would have the same question (feel free to correct me if I am wrong). I would like my application to have control over the HTTPS<->HTTP redirects and would rather not force that logic into the forward facing web server if at all possible. That just seems like an extra configuration step that wouldn't necessarily be needed if I could figure out how to pass SSL status from the forward facing web server to the backend proxy (i.e. CherryPy and my app). So, do you (or anyone else) know of a good way to to this? Or, does everyone just assume that it is all or nothing for SSL when you are proxying to a backend? Thank you. -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 From pstradomski at gmail.com Mon Apr 6 18:32:18 2009 From: pstradomski at gmail.com (=?utf-8?q?Pawe=C5=82_Stradomski?=) Date: Mon, 6 Apr 2009 18:32:18 +0200 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <49DA2CCA.3030402@rcs-comp.com> References: <20090404235506.GA23458@cskk.homeip.net> <49DA2CCA.3030402@rcs-comp.com> Message-ID: <200904061832.18551.pstradomski@gmail.com> W li?cie Randy Syring z dnia poniedzia?ek, 6 kwietnia 2009: > I would like my application to have control over the HTTPS<->HTTP > redirects and would rather not force that logic into the forward facing > web server if at all possible. That just seems like an extra > configuration step that wouldn't necessarily be needed if I could figure > out how to pass SSL status from the forward facing web server to the > backend proxy (i.e. CherryPy and my app). > > So, do you (or anyone else) know of a good way to to this? Or, does > everyone just assume that it is all or nothing for SSL when you are > proxying to a backend? > Check with IIS manual, it should be possible to set some nonstandard header when the connection goes through SSL, and then check this header in your application. Maybe that header is already there - write a simple controller that prints all the headers from the request and check how it looks with and without SSL (but verify with the IIS manual anyway). -- Pawe? Stradomski From graham.dumpleton at gmail.com Tue Apr 7 00:35:05 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 7 Apr 2009 08:35:05 +1000 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <200904061832.18551.pstradomski@gmail.com> References: <20090404235506.GA23458@cskk.homeip.net> <49DA2CCA.3030402@rcs-comp.com> <200904061832.18551.pstradomski@gmail.com> Message-ID: <88e286470904061535h24ceb4f1yd28ca26567349a78@mail.gmail.com> Using nginx as front end to Apache/mod_wsgi as an example: On nginx side you would use: proxy_set_header X-Url-Scheme $scheme; and on Apache/mod_wsgi side, with Django 1.0 as an example, in WSGI script file we would have: import os, sys sys.path.append('/usr/local/django') os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings' import django.core.handlers.wsgi _application = django.core.handlers.wsgi.WSGIHandler() def application(environ, start_response): environ['wsgi.url_scheme'] = environ.get('HTTP_X_URL_SCHEME', 'http') return _application(environ, start_response) Is the equivalent on IIS side as others have mentioned that you need. Graham 2009/4/7 Pawe? Stradomski : > W li?cie Randy Syring z dnia poniedzia?ek, 6 kwietnia 2009: > >> I would like my application to have control over the HTTPS<->HTTP >> redirects and would rather not force that logic into the forward facing >> web server if at all possible. ?That just seems like an extra >> configuration step that wouldn't necessarily be needed if I could figure >> out how to pass SSL status from the forward facing web server to the >> backend proxy (i.e. CherryPy and my app). >> >> So, do you (or anyone else) know of a good way to to this? ?Or, does >> everyone just assume that it is all or nothing for SSL when you are >> proxying to a backend? >> > Check with IIS manual, it should be possible to set some nonstandard header > when the connection goes through SSL, and then check this header in your > application. Maybe that header is already there - write a simple controller > that prints all the headers from the request and check how it looks with and > without SSL (but verify with the IIS manual anyway). > > -- > Pawe? Stradomski > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From randy at rcs-comp.com Tue Apr 7 01:01:27 2009 From: randy at rcs-comp.com (Randy Syring) Date: Mon, 06 Apr 2009 19:01:27 -0400 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <88e286470904061535h24ceb4f1yd28ca26567349a78@mail.gmail.com> References: <20090404235506.GA23458@cskk.homeip.net> <49DA2CCA.3030402@rcs-comp.com> <200904061832.18551.pstradomski@gmail.com> <88e286470904061535h24ceb4f1yd28ca26567349a78@mail.gmail.com> Message-ID: <49DA89C7.2080809@rcs-comp.com> Graham, Excellent, thank you! That confirms for me the concept is correct, now all I have to do is work on an IIS implementation. FUN! -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 Graham Dumpleton wrote: > Using nginx as front end to Apache/mod_wsgi as an example: > > On nginx side you would use: > > proxy_set_header X-Url-Scheme $scheme; > > and on Apache/mod_wsgi side, with Django 1.0 as an example, in WSGI > script file we would have: > > import os, sys > sys.path.append('/usr/local/django') > > os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings' > > import django.core.handlers.wsgi > > _application = django.core.handlers.wsgi.WSGIHandler() > > def application(environ, start_response): > environ['wsgi.url_scheme'] = environ.get('HTTP_X_URL_SCHEME', 'http') > return _application(environ, start_response) > > Is the equivalent on IIS side as others have mentioned that you need. > > Graham > > 2009/4/7 Pawe? Stradomski : > >> W li?cie Randy Syring z dnia poniedzia?ek, 6 kwietnia 2009: >> >> >>> I would like my application to have control over the HTTPS<->HTTP >>> redirects and would rather not force that logic into the forward facing >>> web server if at all possible. That just seems like an extra >>> configuration step that wouldn't necessarily be needed if I could figure >>> out how to pass SSL status from the forward facing web server to the >>> backend proxy (i.e. CherryPy and my app). >>> >>> So, do you (or anyone else) know of a good way to to this? Or, does >>> everyone just assume that it is all or nothing for SSL when you are >>> proxying to a backend? >>> >>> >> Check with IIS manual, it should be possible to set some nonstandard header >> when the connection goes through SSL, and then check this header in your >> application. Maybe that header is already there - write a simple controller >> that prints all the headers from the request and check how it looks with and >> without SSL (but verify with the IIS manual anyway). >> >> -- >> Pawe? Stradomski >> _______________________________________________ >> Web-SIG mailing list >> Web-SIG at python.org >> Web SIG: http://www.python.org/sigs/web-sig >> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >> >> > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/randy%40rcs-comp.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ianb at colorstudy.com Tue Apr 7 01:32:20 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 6 Apr 2009 18:32:20 -0500 Subject: [Web-SIG] Reverse Proxy & HTTPS In-Reply-To: <49DA89C7.2080809@rcs-comp.com> References: <20090404235506.GA23458@cskk.homeip.net> <49DA2CCA.3030402@rcs-comp.com> <200904061832.18551.pstradomski@gmail.com> <88e286470904061535h24ceb4f1yd28ca26567349a78@mail.gmail.com> <49DA89C7.2080809@rcs-comp.com> Message-ID: A last note: paste.deploy.config.PrefixMiddleware does some fixup for cases like this, including looking at X-Forwarded-Scheme and X-Forwarded-Proto for the protocol (both names, because there's nothing approaching consensus on what to name these headers). 2009/4/6 Randy Syring > Graham, > > Excellent, thank you! That confirms for me the concept is correct, now all > I have to do is work on an IIS implementation. FUN! > > -------------------------------------- > Randy Syring > RCS Computers & Web Solutions > 502-644-4776http://www.rcs-comp.com > > "Whether, then, you eat or drink or > whatever you do, do all to the glory > of God." 1 Cor 10:31 > > > > Graham Dumpleton wrote: > > Using nginx as front end to Apache/mod_wsgi as an example: > > On nginx side you would use: > > proxy_set_header X-Url-Scheme $scheme; > > and on Apache/mod_wsgi side, with Django 1.0 as an example, in WSGI > script file we would have: > > import os, sys > sys.path.append('/usr/local/django') > > os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings' > > import django.core.handlers.wsgi > > _application = django.core.handlers.wsgi.WSGIHandler() > > def application(environ, start_response): > environ['wsgi.url_scheme'] = environ.get('HTTP_X_URL_SCHEME', 'http') > return _application(environ, start_response) > > Is the equivalent on IIS side as others have mentioned that you need. > > Graham > > 2009/4/7 Pawe? Stradomski : > > > W li?cie Randy Syring z dnia poniedzia?ek, 6 kwietnia 2009: > > > > I would like my application to have control over the HTTPS<->HTTP > redirects and would rather not force that logic into the forward facing > web server if at all possible. That just seems like an extra > configuration step that wouldn't necessarily be needed if I could figure > out how to pass SSL status from the forward facing web server to the > backend proxy (i.e. CherryPy and my app). > > So, do you (or anyone else) know of a good way to to this? Or, does > everyone just assume that it is all or nothing for SSL when you are > proxying to a backend? > > > > Check with IIS manual, it should be possible to set some nonstandard header > when the connection goes through SSL, and then check this header in your > application. Maybe that header is already there - write a simple controller > that prints all the headers from the request and check how it looks with and > without SSL (but verify with the IIS manual anyway). > > -- > Pawe? Stradomski > _______________________________________________ > Web-SIG mailing listWeb-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > > _______________________________________________ > Web-SIG mailing listWeb-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > > Unsubscribe: http://mail.python.org/mailman/options/web-sig/randy%40rcs-comp.com > > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/ianb%40colorstudy.com > > -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From arw1961 at yahoo.com Tue Apr 7 15:13:39 2009 From: arw1961 at yahoo.com (Aaron Watters) Date: Tue, 7 Apr 2009 06:13:39 -0700 (PDT) Subject: [Web-SIG] Please look at WHIFF -- WSGI/HTTP INTEGRATED FILESYSTEM FRAMES Message-ID: <882586.14227.qm@web32003.mail.mud.yahoo.com> Hi folks, I tried this announcement on some easy going lists yesterday and no one has taken me to the woodshed yet, so I thought I'd have a go at a tougher crowd. I'm releasing a WSGI component suite called WHIFF and I'd just love it if you folks would have a look and comment/suggest/criticize/complain. If you'd like to try it out -- even better. Please go http://whiff.sourceforge.net Or use one of the links in the announcement below. Thanks -- Aaron Watters === THIS .SIG IS INTENTIONALLY LEFT BLANK === WHIFF -- WSGI/HTTP INTEGRATED FILESYSTEM FRAMES WHIFF is an infrastructure for easily building complex Python/WSGI Web applications by combining smaller and simpler WSGI components organized within file system trees. To DOWNLOAD WHIFF go to the WHIFF project information page at http://sourceforge.net/projects/whiff and follow the download instructions. To GET THE LATEST WHIFF clone the WHIFF Mercurial repository located at http://aaron.oirt.rutgers.edu/cgi-bin/whiffRepo.cgi. To READ ABOUT WHIFF view the WHIFF documentation at http://aaron.oirt.rutgers.edu/myapp/docs/W.intro. To PLAY WITH WHIFF try the demos listed in the demos page at http://aaron.oirt.rutgers.edu/myapp/docs/W1300.testAndDemo. Why WHIFF? ========== WHIFF (WSGI HTTP Integrated Filesystem Frames) is intended to make it easier to create, deploy, and maintain large and complex Python based WSGI Web applications. I created WHIFF to address complexity issues I encounter when creating and fixing sophisticated Web applications which include complex database interactions and dynamic features such as AJAX (Asynchronous JavaScript and XML). The primary tools which reduce complexity are an infrastructure for managing web application name spaces, a configuration template language for wiring named components into an application, and an applications programmer interface for accessing named components from Python and javascript modules. All supporting conventions and tools offered by WHIFF are optional. WHIFF is designed to work well with other modules conformant to the WSGI (Web Service Gateway Interface) standard. Developers and designers are free to use those WHIFF tools that work for them and ignore or replace the others. WHIFF does not provide a "packaged cake mix" for baking a web application. Instead WHIFF is designed to provide a set of ingredients which can be easily combined to make web applications (with no need to refine your own sugar or mill your own wheat). I hope you like it. -- Aaron Watters From brian at briansmith.org Wed Apr 8 05:30:55 2009 From: brian at briansmith.org (Brian Smith) Date: Tue, 7 Apr 2009 22:30:55 -0500 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words Message-ID: <00cc01c9b7fa$6dd82a80$49887f80$@org> Here is the change that removes the use of RFC 2047 from HTTP in HTTPbis. -----Original Message----- From: ietf-http-wg-request at w3.org [mailto:ietf-http-wg-request at w3.org] On Behalf Of Mark Nottingham Sent: Monday, April 06, 2009 5:00 To: HTTP Working Group Subject: Closing #63: RFC2047 encoded words The editors believe that issue #63 has been addressed by the changes in the -06 drafts. Specifically, RFC2047 encoding is no longer suggested as the default encoding for non-ASCII characters; rather, it is left up to specific header definitions to specify. From fumanchu at aminus.org Wed Apr 8 18:57:43 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 8 Apr 2009 09:57:43 -0700 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: <00cc01c9b7fa$6dd82a80$49887f80$@org> References: <00cc01c9b7fa$6dd82a80$49887f80$@org> Message-ID: Brian Smith wrote: > Here is the change that removes the use of RFC 2047 from HTTP in > HTTPbis. Yes, but parsers need to continue decoding them for many years to come. IMO WSGI origin servers should do this so we can write the decoding logic once and forget about it (assuming middleware and apps far outnumber origin servers). Robert Brewer fumanchu at aminus.org From foom at fuhm.net Wed Apr 8 20:14:10 2009 From: foom at fuhm.net (James Y Knight) Date: Wed, 8 Apr 2009 14:14:10 -0400 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: References: <00cc01c9b7fa$6dd82a80$49887f80$@org> Message-ID: <8389CCA8-8ABD-49A0-AEB8-11F26083DBA5@fuhm.net> On Apr 8, 2009, at 12:57 PM, Robert Brewer wrote: > Yes, but parsers need to continue decoding them for many years to > come. > IMO WSGI origin servers should do this so we can write the decoding > logic once and forget about it (assuming middleware and apps far > outnumber origin servers). Decoding RFC 2047 quoted words is rather trivial compared to correctly parsing all the HTTP headers. Plus, as I said before, you can't even *do* the RFC2047 decoding without parsing the headers at the same time to figure out which pieces need to be decoded! And furthermore, nobody needs to "continue" decoding them for years to come, *because nobody decodes them now*! WSGI is intentionally exposing a fairly low-level view of the world. So my opinion is that the headers in the dict should be byte strings and that anyone who wants decoded headers also probably really wants (or ought to want!) parsed headers, and thus should be using an http header parsing library. That can expose values as unicode strings if it wants to. If you want to start a discussion about having a standard parsed- header object in WSGI, that's another thing, but saying that WSGI servers should *partially* decode the headers seems rather silly to me. James From brian at briansmith.org Wed Apr 8 20:20:28 2009 From: brian at briansmith.org (Brian Smith) Date: Wed, 8 Apr 2009 13:20:28 -0500 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words Message-ID: <000801c9b876$b7423130$25c69390$@org> Robert Brewer wrote: > Brian Smith wrote: > > Here is the change that removes the use of RFC 2047 from HTTP in > > HTTPbis. > > Yes, but parsers need to continue decoding them for many years to come. > IMO WSGI origin servers should do this so we can write the decoding > logic once and forget about it (assuming middleware and apps far > outnumber origin servers). No, it really is better for WSGI implementations to completely avoid RFC 2047. None of the HTTP specifications ever specified how RFC 2047 was to be used in HTTP. RFC 2616 vaguely suggested the use of RFC 2047 but it was never integrated into any part of the grammar. In the long discussions on this topic in the HTTP working group, nobody ever presented a real-life example where RFC 2047 encoding has actually been used. The hypothetical examples that were presented in the discussion were found to violate RFC 2047 and/or other parts of the HTTP specification. Nobody ever presented an example (even hypothetical) using RFC 2047 encoding that the working group agreed was valid. From ianb at colorstudy.com Wed Apr 8 21:01:39 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Wed, 8 Apr 2009 14:01:39 -0500 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: <8389CCA8-8ABD-49A0-AEB8-11F26083DBA5@fuhm.net> References: <00cc01c9b7fa$6dd82a80$49887f80$@org> <8389CCA8-8ABD-49A0-AEB8-11F26083DBA5@fuhm.net> Message-ID: On Wed, Apr 8, 2009 at 1:14 PM, James Y Knight wrote: > If you want to start a discussion about having a standard parsed-header > object in WSGI, that's another thing, Off topic to this discussion, but that's what WebOb is. It also largely handles the encoding issues, abstracts away the awkwardness of the WSGI call signature, and also does header parsing. -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan at xhaus.com Thu Apr 9 01:58:54 2009 From: alan at xhaus.com (Alan Kennedy) Date: Thu, 9 Apr 2009 00:58:54 +0100 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: <8389CCA8-8ABD-49A0-AEB8-11F26083DBA5@fuhm.net> References: <00cc01c9b7fa$6dd82a80$49887f80$@org> <8389CCA8-8ABD-49A0-AEB8-11F26083DBA5@fuhm.net> Message-ID: <4a951aa00904081658v66892850wce13e5a8093b38c6@mail.gmail.com> [James] > If you want to start a discussion about having a standard parsed-header > object in WSGI, that's another thing, but saying that WSGI servers should > *partially* decode the headers seems rather silly to me. Hi James, It's a shame that your proposal to add the twisted header parsing library to the standard library didn't catch on years ago. http://mail.python.org/pipermail/web-sig/2006-February/002119.html Alan. From alan at xhaus.com Thu Apr 9 01:59:11 2009 From: alan at xhaus.com (Alan Kennedy) Date: Thu, 9 Apr 2009 00:59:11 +0100 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: <00cc01c9b7fa$6dd82a80$49887f80$@org> References: <00cc01c9b7fa$6dd82a80$49887f80$@org> Message-ID: <4a951aa00904081659k31e6464co7480474fd62d30ab@mail.gmail.com> [Brian] > Here is the change that removes the use of RFC 2047 from HTTP in HTTPbis. Grand so; all we need to do is to wait for everyone to stop using HTTP/1.1, start using HTTP/bis, and our problems are at an end! ;-) Alan. From brian at briansmith.org Thu Apr 9 04:36:22 2009 From: brian at briansmith.org (Brian Smith) Date: Wed, 8 Apr 2009 21:36:22 -0500 Subject: [Web-SIG] FW: Closing #63: RFC2047 encoded words In-Reply-To: <4a951aa00904081659k31e6464co7480474fd62d30ab@mail.gmail.com> References: <00cc01c9b7fa$6dd82a80$49887f80$@org> <4a951aa00904081659k31e6464co7480474fd62d30ab@mail.gmail.com> Message-ID: <001e01c9b8bc$01bdfb00$0539f100$@org> Alan Kennedy wrote: > [Brian] > > Here is the change that removes the use of RFC 2047 from HTTP in > HTTPbis. > > Grand so; all we need to do is to wait for everyone to stop using > HTTP/1.1, start using HTTP/bis, and our problems are at an end! HTTPbis *is* (will be) HTTP/1.1. It doesn't define a new version of the protocol. RFC 2616 has many mistakes that make it a poor description of HTTP/1.1 and the purpose of HTTPbis is to fix those mistakes. That is a little bit of an over-simplification. Try to create a RFC2616-compliant message that uses RFC 2047 encoding. It can't be done because RFC 2047 was never integrated into the RFC 2616 grammar. That is why HTTPbis removed the vague reference to RFC 2047 from the prose. If RFC 2616 provided a way of using RFC 2047 in HTTP messages then HTTPbis would still allow it but recommend that implementations SHOULD NOT use it (similar to how line-folding is deprecated but still allowed in HTTPbis). - Brian From pfein at pobox.com Fri Apr 10 23:12:51 2009 From: pfein at pobox.com (Pete) Date: Fri, 10 Apr 2009 16:12:51 -0500 Subject: [Web-SIG] RESTful Python email list? Message-ID: This came up at the REST BoF at Pycon... Any interest in a dedicated email list for REST + python, a la the restful-json group [0]? The group would discuss strategies for REST architecture built with and within Python. WSGI 1.0 vs. 2.0 vs. 2e6 is out of scope. ;-) --Pete [0] - http://groups.google.com/group/restful-json From alan at xhaus.com Sat Apr 11 15:05:16 2009 From: alan at xhaus.com (Alan Kennedy) Date: Sat, 11 Apr 2009 14:05:16 +0100 Subject: [Web-SIG] RESTful Python email list? In-Reply-To: References: Message-ID: <4a951aa00904110605o625554d9x61ed39420e523825@mail.gmail.com> [Pete] > Any interest in a dedicated email list for REST + python, a la the > restful-json group [0]? ?The group would discuss strategies for REST > architecture built with and within Python. ?WSGI 1.0 vs. 2.0 vs. 2e6 is out > of scope. ;-) Just a thought: is there any reason why RESTful python discussions cannot take place on the restful-json group referred to? Alan. From jim at zope.com Sat Apr 11 16:01:45 2009 From: jim at zope.com (Jim Fulton) Date: Sat, 11 Apr 2009 10:01:45 -0400 Subject: [Web-SIG] RESTful Python email list? In-Reply-To: References: Message-ID: On Apr 10, 2009, at 5:12 PM, Pete wrote: > This came up at the REST BoF at Pycon... > > Any interest in a dedicated email list for REST + python, a la the > restful-json group [0]? The group would discuss strategies for REST > architecture built with and within Python. WSGI 1.0 vs. 2.0 vs. 2e6 > is out of scope. ;-) -1 I'd be happy to see the discussions here. Jim -- Jim Fulton Zope Corporation From milesck at umich.edu Sun Apr 12 02:48:59 2009 From: milesck at umich.edu (Miles Kaufmann) Date: Sat, 11 Apr 2009 20:48:59 -0400 Subject: [Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules Message-ID: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> Hi everyone, I read through the recent archives, and I've seen some discussion on similar topics, but not this exact topic recently, so if the solution to these issues has already been decided, please point me to the relevant messages. (Also, if this isn't the most appropriate list, please let me know!) The first issue is that there doesn't seem to be a way to parse x-www-form-urlencoded query strings in a character set other than UTF-8, for example: 'premier=un&deuxi%E8me=deux' # latin-1 The urllib.parse.unquote* functions take encoding and errors parameters, but none of the higher-level ones. The solution to me seems to be that functions that build on top of it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage constructor--should grow encoding and errors parameters that they pass through to the lower-level functions. The second issue is that the FieldStorage classes work with text input streams. However, with multipart/form-data posts, posted files aren't necessarily in the same encoding as form fields, or may be binary and not text at all. I would suggest that FieldStorage should be changed to take a binary input stream. For multipart forms, it should only attempt to decode a part with the passed-in FieldStorage encoding if the part's content type is text/plain and the content-disposition does not specify a filename; otherwise, field.file would be a binary file, and field.value should be bytes or non-existent. Here is a example form submission that is currently difficult to handle with the cgi module, posted from a page with a charset of UTF-8 and two attached files; this is similar to how a real form submission from Safari or Firefox would look: post_input = b"""---123 Content-Disposition: form-data; name="utf8text" \xc2\xa1ol\xc3\xa9! ---123 Content-Disposition: form-data; name="file1"; filename="latin1.txt" Content-Type: text/plain Oh l\xe0 l\xe0! ---123 Content-Disposition: form-data; name="file2"; filename="binary" Content-Type: application/octet-stream \x80\x81\x82\x83\x84\x85\x86\x87\xad\xf0 ---123-- """ environ = {'CONTENT_LENGTH':str(len(post_input)), 'CONTENT_TYPE': 'multipart/form-data; boundary=-123', 'REQUEST_METHOD': 'POST'} It's possible that the email.mime and http packages might also need some changes made, but I haven't looked into those as much. Also, cgi.parse_multipart seems to be broken currently, since it uses http.client.parse_headers which expects a bytes stream. If there's agreement on these points, I think it would be important to get these changes (or perhaps alternate fixes) into Python 3.1; I know that some of the changes are backwards incompatible with 3.0, but I think that the encoding issues in the current cgi module make it very difficult to work with. I'm willing to take responsibility for submitting bug reports and patches, but could probably use a more experienced mentor to let me know if I'm doing it wrong. If you don't think that these changes are reasonable, I'm interested to hear your alternate suggestions. I strongly believe that the current behavior is broken and needs to be changed for 3.1. Thanks for your consideration, Miles Kaufmann From milesck at umich.edu Sun Apr 12 03:41:51 2009 From: milesck at umich.edu (Miles Kaufmann) Date: Sat, 11 Apr 2009 21:41:51 -0400 Subject: [Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules In-Reply-To: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> References: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> Message-ID: <5ec9495f0904111841k4075c17cl6bd5ef1dbc17595a@mail.gmail.com> On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote: > ... > It's possible that the email.mime and http packages might also need > some changes made, but I haven't looked into those as much. > ... Apparently there's been some discussion on the python-dev and email-sig lists in the past couple of days since I last checked, about the email package and strings and bytes. So it might be the case that the cgi module will build on top of those decisions. But I want to make sure that the cgi module isn't left behind, and I think that having FieldStorage being built from string streams instead of byte streams is a mistake that should be rectified ASAP. On Fri, Apr 10, 2009 at 12:35 PM, Bill Janssen wrote [1]: > Barry Warsaw wrote: >> In that case, we really need the >> bytes-in-bytes-out-bytes-in-the-chewy- >> center API first, and build things on top of that. > > Yep. -Miles Kaufmann [1] http://mail.python.org/pipermail/email-sig/2009-April/000438.html From ogbujic at ccf.org Mon Apr 13 16:40:27 2009 From: ogbujic at ccf.org (Chimezie Ogbuji) Date: Mon, 13 Apr 2009 10:40:27 -0400 Subject: [Web-SIG] Closing long-running WSGI requests (possible?) Message-ID: Hello. I have a problem with a WSGI-based SPARQL server that I have been unable to resolve for some time. I was told this is the best place to ask :). I'm building a SPARQL [1] server that is deployed as WSGI/Paste server. SPARQL queries are handled by the server and evaluated against a MySQL database using mysql-python/MySQLdb to manage the connection. My goal is to be able to allow clients to close the connection in order to kill queries that have been dispatched (in order to 'abort' them). Unfortunately, when the client kills the connection, the application is not signaled in any way. So, the result is that (for long-running queries), the MySQL query continues to run even after the connection is closed (by clicking cancel in the browser for instance). I would expect that when the connection is closed at the client side, this should trigger a chain reaction of garbage collection (deletion of the application object, and all the objects attributed to it including the DB connection, etc.) that bottoms out in the db connection closing and MySQLdb killing the query as a side effect of calling __del__ on the cursor and database connection. However, this is not what is happening and it appears that the once the result is served back to the client, the server and the client are completely 'disconnected' for that particular request. Am I going about his the wrong way? Does WSGI simply not have anything to say about such a situation ? If the problem isn't WSGI, is there another WSGI implementation that is known to behave as expected (i.e., closing the connection dispatches the deletion of the objects involved in the request handling)? I was told to look into keep-alive, but the specification doesn't seem to suggest that this would help me as it has more to do with re-using connections for subsequent requests rather than specifying that the server maintains a connection between the request and the objects involved in handling the request at the server. Any help would be greatly appreciated. Thanks [1] http://www.w3.org/TR/rdf-sparql-query/ =================================== P Please consider the environment before printing this e-mail Cleveland Clinic is ranked one of the top hospitals in America by U.S. News & World Report (2008). Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations. Confidentiality Note: This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy. Thank you. From christian at dowski.com Mon Apr 13 16:53:10 2009 From: christian at dowski.com (Christian Wyglendowski) Date: Mon, 13 Apr 2009 10:53:10 -0400 Subject: [Web-SIG] Closing long-running WSGI requests (possible?) In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 10:40 AM, Chimezie Ogbuji wrote: > Hello. I have a problem with a WSGI-based SPARQL server that I have been > unable to resolve for some time. I was told this is the best place to ask > :). I'm building a SPARQL [1] server that is deployed as WSGI/Paste > server. SPARQL queries are handled by the server and evaluated against a > MySQL database using mysql-python/MySQLdb to manage the connection. > > My goal is to be able to allow clients to close the connection in order to > kill queries that have been dispatched (in order to 'abort' them). This should be doable from what I understand. From PEP 333: "If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an error. (This is to support resource release by the application. This protocol is intended to complement PEP 325's generator support, and other common iterables with close() methods." [1] So it sounds like you could add a close method on whatever iterable that your application returns and have it do the required resource release there. HTH, Christian http://www.dowski.com [1] http://www.python.org/dev/peps/pep-0333/#specification-details From ionel.mc at gmail.com Mon Apr 13 18:01:09 2009 From: ionel.mc at gmail.com (Ionel Maries Cristian) Date: Mon, 13 Apr 2009 19:01:09 +0300 Subject: [Web-SIG] Closing long-running WSGI requests (possible?) In-Reply-To: References: Message-ID: That implies one would have extremely reliable tcp connections, and clients graciously shutdown the connection and the server is notified of that. Most of the time that doesn't happen and the solution is to continuously send keepalive packets (some small string or whatever) - I'm assuming you run a batch a set of queries and you can interleave yielding some data while you run that batch. For example if your client disconnects and the servers tries to send some data it would fail - and trigger closing the app iterable. In contrast a server that just runs some backend processing without moving any data around doesn't have any way to know if the connection is still valid. Then again, even if the client properly shutdown the connection the server won't do anything about it if it doesn't try to do anything with the socket due to the synchronous nature (I'm assuming) of the whole server/app. -- ionel On Mon, Apr 13, 2009 at 17:53, Christian Wyglendowski wrote: > On Mon, Apr 13, 2009 at 10:40 AM, Chimezie Ogbuji wrote: > > Hello. I have a problem with a WSGI-based SPARQL server that I have been > > unable to resolve for some time. I was told this is the best place to > ask > > :). I'm building a SPARQL [1] server that is deployed as WSGI/Paste > > server. SPARQL queries are handled by the server and evaluated against a > > MySQL database using mysql-python/MySQLdb to manage the connection. > > > > My goal is to be able to allow clients to close the connection in order > to > > kill queries that have been dispatched (in order to 'abort' them). > > This should be doable from what I understand. From PEP 333: > > "If the iterable returned by the application has a close() method, the > server or gateway must call that method upon completion of the current > request, whether the request was completed normally, or terminated > early due to an error. (This is to support resource release by the > application. This protocol is intended to complement PEP 325's > generator support, and other common iterables with close() methods." > [1] > > So it sounds like you could add a close method on whatever iterable > that your application returns and have it do the required resource > release there. > > HTH, > > Christian > http://www.dowski.com > > [1] http://www.python.org/dev/peps/pep-0333/#specification-details > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/ionel.mc%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arw1961 at yahoo.com Mon Apr 13 20:13:05 2009 From: arw1961 at yahoo.com (Aaron Watters) Date: Mon, 13 Apr 2009 11:13:05 -0700 (PDT) Subject: [Web-SIG] Closing long-running WSGI requests (possible?) Message-ID: <734289.37918.qm@web32004.mail.mud.yahoo.com> I agree with Ionel I personally wouldn't rely on "kill wsgi request". I'd run the update in a subprocess and kill the subprocess using a signal when the user requests (on unix, of course). I'd also check a log written by the subprocess to see whether it completed or not. If you "kill the wsgi request" you have the problem of not being quite sure whether the kill arrived in time, among other possible difficulties, some mentioned by Ionel. -- Aaron Watters http://aaron.oirt.rutgers.edu/myapp/docs/W0500.quickstart (apologies to Christian, who got this twice, I forgot to "reply all") --- On Mon, 4/13/09, Ionel Maries Cristian wrote: > From: Ionel Maries Cristian > Subject: Re: [Web-SIG] Closing long-running WSGI requests (possible?) > To: "Christian Wyglendowski" > Cc: "Chimezie Ogbuji" , web-sig at python.org > Date: Monday, April 13, 2009, 12:01 PM > That implies one would have extremely > reliable tcp connections, and clients > graciously shutdown the connection and the server is > notified of that. > > Most of the time that doesn't happen and the solution > is to continuously send > > keepalive packets (some small string or whatever) - I'm > assuming you run > a batch a set of queries and you can interleave yielding > some data while > you run that batch. > > For example if your client disconnects and the servers > tries to send some data > > it would fail - and trigger closing the app iterable. > > In contrast a server that just runs some backend processing > without moving > any data around doesn't have any way to know if the > connection is still valid. > > > Then again, even if the client properly shutdown the > connection the server > won't do anything about it if it doesn't try to do > anything with the socket due > to the synchronous nature (I'm assuming) of the whole > server/app. > > > -- ionel > > > > > On Mon, Apr 13, 2009 at 17:53, > Christian Wyglendowski > wrote: > > On Mon, Apr 13, 2009 at 10:40 AM, Chimezie > Ogbuji > wrote: > > > Hello. I have a problem with a WSGI-based SPARQL > server that I have been > > > unable to resolve for some time. I was told this is > the best place to ask > > > :). I'm building a SPARQL [1] server that is > deployed as WSGI/Paste > > > server. SPARQL queries are handled by the server and > evaluated against a > > > MySQL database using mysql-python/MySQLdb to manage > the connection. > > > > > > My goal is to be able to allow clients to close the > connection in order to > > > kill queries that have been dispatched (in order to > 'abort' them). > > > > This should be doable from what I understand. From > PEP 333: > > > > "If the iterable returned by the application has a > close() method, the > > server or gateway must call that method upon completion of > the current > > request, whether the request was completed normally, or > terminated > > early due to an error. (This is to support resource release > by the > > application. This protocol is intended to complement PEP > 325's > > generator support, and other common iterables with close() > methods." > > [1] > > > > So it sounds like you could add a close method on whatever > iterable > > that your application returns and have it do the required > resource > > release there. > > > > HTH, > > > > Christian > > http://www.dowski.com > > > > [1] http://www.python.org/dev/peps/pep-0333/#specification-details > > _______________________________________________ > > Web-SIG mailing list > > Web-SIG at python.org > > Web SIG: http://www.python.org/sigs/web-sig > > Unsubscribe: http://mail.python.org/mailman/options/web-sig/ionel.mc%40gmail.com > > > > > -----Inline Attachment Follows----- > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/arw1961%40yahoo.com > From graham.dumpleton at gmail.com Mon Apr 13 22:58:22 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 14 Apr 2009 06:58:22 +1000 Subject: [Web-SIG] Closing long-running WSGI requests (possible?) In-Reply-To: References: Message-ID: <88e286470904131358t1b9c8ab5we9fed25e656ae414@mail.gmail.com> No, cannot really be done. This has been discussed a couple of times on the mod_wsgi list. One such discussion is at: http://groups.google.com/group/modwsgi/browse_frm/thread/8ebd9aca9d317ac9 In general the same issues apply to all WSGI implementations. Graham 2009/4/14 Chimezie Ogbuji : > Hello. ?I have a problem with a WSGI-based SPARQL server that I have been > unable to resolve for some time. ?I was told this is the best place to ask > :). ?I'm building a SPARQL [1] server that is deployed as ?WSGI/Paste > server. ?SPARQL queries are handled by the server and evaluated against a > MySQL database using mysql-python/MySQLdb to manage the connection. > > My goal is to be able to allow clients to close the connection in order to > kill queries that have been dispatched (in order to 'abort' them). > Unfortunately, when the client kills the connection, the application is not > signaled in any way. ?So, the result is that (for long-running queries), the > MySQL query continues to run even after the connection is closed (by > clicking cancel in the browser for instance). > > I would expect that when the connection is closed at the client side, this > should trigger a chain reaction of garbage collection (deletion of the > application object, and all the objects attributed to it including the DB > connection, etc.) that bottoms out in the db connection closing and MySQLdb > killing the query as a side effect of calling __del__ on the cursor and > database connection. ?However, this is not what is happening and it appears > that the once the result is served back to the client, the server and the > client are completely 'disconnected' for that particular request. > > Am I going about his the wrong way? Does WSGI simply not have anything to > say about such a situation ? If the problem isn't > WSGI, is there another WSGI implementation that is known to behave as > expected (i.e., closing the connection dispatches the deletion of the > objects involved in the request handling)? > > I was told to look into keep-alive, but the specification doesn't seem to > suggest that this would help me as it has more to do with re-using > connections for subsequent requests rather than specifying that the server > maintains a connection between the request and the objects involved in > handling the request at the server. > > Any help would be greatly appreciated. > > Thanks > > [1] http://www.w3.org/TR/rdf-sparql-query/ > > > =================================== > > P Please consider the environment before printing this e-mail > > Cleveland Clinic is ranked one of the top hospitals > in America by U.S. News & World Report (2008). > Visit us online at http://www.clevelandclinic.org for > a complete listing of our services, staff and > locations. > > > Confidentiality Note: ?This message is intended for use > only by the individual or entity to which it is addressed > and may contain information that is privileged, > confidential, and exempt from disclosure under applicable > law. ?If the reader of this message is not the intended > recipient or the employee or agent responsible for > delivering the message to the intended recipient, you are > hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. ?If > you have received this communication in error, ?please > contact the sender immediately and destroy the material in > its entirety, whether electronic or hard copy. ?Thank you. > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From manlio_perillo at libero.it Mon Apr 13 23:58:48 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Mon, 13 Apr 2009 23:58:48 +0200 Subject: [Web-SIG] Closing long-running WSGI requests (possible?) In-Reply-To: References: Message-ID: <49E3B598.2000803@libero.it> Chimezie Ogbuji ha scritto: > Hello. I have a problem with a WSGI-based SPARQL server that I have been > unable to resolve for some time. I was told this is the best place to ask > :). I'm building a SPARQL [1] server that is deployed as WSGI/Paste > server. SPARQL queries are handled by the server and evaluated against a > MySQL database using mysql-python/MySQLdb to manage the connection. > > My goal is to be able to allow clients to close the connection in order to > kill queries that have been dispatched (in order to 'abort' them). > Unfortunately, when the client kills the connection, the application is not > signaled in any way. So, the result is that (for long-running queries), the > MySQL query continues to run even after the connection is closed (by > clicking cancel in the browser for instance). > > [...] What you want to do is not possible. A more viable solution is to use JavaScript. Add a custom "abort button" on the web page so that a function is associate to the "click" event. Also, you should associate a function to the "unload" event (where you can check if there are active queries). In the JavaScript function you can issue an XMLHTTPRequest, using an unique identifier. Note that if you use PostgreSQL, you can use: http://www.postgresql.org/docs/8.3/interactive/protocol-flow.html#AEN73870 When you create a connection to PostgreSQL, the server will send you the backend process id an unique key. You can use this data to send a cancellation request. All you need to do is to pass the process id and the unique key to the client (with some encryption so that the client can use the data only once). Unfortunately, libpq does not offer a flexible interface to this feature. The PGCancel structure is opaque, so you need some hacking. Manlio Perillo From davidgshi at yahoo.co.uk Tue Apr 14 12:29:37 2009 From: davidgshi at yahoo.co.uk (David Shi) Date: Tue, 14 Apr 2009 10:29:37 +0000 (GMT) Subject: [Web-SIG] RESTful Python email list? Message-ID: <420162.76101.qm@web26306.mail.ukl.yahoo.com> I am using Python and promoting the use of Python.? I am now interesting in finding good demos on generating tokens dynamically and using Javascript to call?RESTful services with token embedded. ? Regards. ? David --- On Sat, 11/4/09, Jim Fulton wrote: From: Jim Fulton Subject: Re: [Web-SIG] RESTful Python email list? To: "Pete" Cc: web-sig at python.org Date: Saturday, 11 April, 2009, 3:01 PM On Apr 10, 2009, at 5:12 PM, Pete wrote: > This came up at the REST BoF at Pycon... > > Any interest in a dedicated email list for REST + python, a la the restful-json group [0]?? The group would discuss strategies for REST architecture built with and within Python.? WSGI 1.0 vs. 2.0 vs. 2e6 is out of scope. ;-) -1 I'd be happy to see the discussions here. Jim -- Jim Fulton Zope Corporation _______________________________________________ Web-SIG mailing list Web-SIG at python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/davidgshi%40yahoo.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidgshi at yahoo.co.uk Wed Apr 15 15:38:54 2009 From: davidgshi at yahoo.co.uk (David Shi) Date: Wed, 15 Apr 2009 13:38:54 +0000 (GMT) Subject: [Web-SIG] Python-generating tokens dynamically at runtime Message-ID: <35827.24291.qm@web26301.mail.ukl.yahoo.com> I am using Python and promoting the use of Python. I am now interesting in finding good demos on generating?tokens dynamically and using Javascript to call RESTful?services with token embedded. Regards. David -------------- next part -------------- An HTML attachment was scrubbed... URL: From milesck at umich.edu Wed Apr 15 23:16:08 2009 From: milesck at umich.edu (Miles Kaufmann) Date: Wed, 15 Apr 2009 17:16:08 -0400 Subject: [Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules In-Reply-To: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> References: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> Message-ID: <5ec9495f0904151416l3f5705a9ufd695c02da3cfebf@mail.gmail.com> On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote: > The first issue is that there doesn't seem to be a way to parse > x-www-form-urlencoded query strings in a character set other than > UTF-8, for example: > > 'premier=un&deuxi%E8me=deux' # latin-1 > > The urllib.parse.unquote* functions take encoding and errors > parameters, but none of the higher-level ones. ?The solution to me > seems to be that functions that build on top of > it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage > constructor--should grow encoding and errors parameters that they pass > through to the lower-level functions. > > The second issue is that the FieldStorage classes work with text input > streams. ?However, with multipart/form-data posts, posted files aren't > necessarily in the same encoding as form fields, or may be binary and > not text at all. ?I would suggest that FieldStorage should be changed > to take a binary input stream. > > [...] I'm not quite sure how to interpret the lack of response I've gotten on this topic. Is it just that there's little interest in the cgi module? Should I raise this issue on the python-dev list, or just open a bug report and start submitting patches? There's been a lot of discussion recently about bytes vs. str in email headers and WSGI environ variables, but I haven't been able to find a substantive discussion on this specific topic. Here are some of the related quotes I've come across. Martin v. L?wis wrote [1]: > In a CGI application, you shouldn't be using sys.stdin or print(). > Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw), > and sys.stdout.buffer.raw. A CGI script essentially does binary IO; > if you use TextIO, there likely will be bugs (e.g. if you have > attachments of type application/octet-stream). bobince wrote [2]: > Evan Fosmark wrote: >> bobince wrote: >>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion >>> that hasn't been fixed properly for the new string model. It should >>> be converting the incoming byte stream to characters before >>> passing them to urllib. >>> >>> Did I mention Python 3.0's libraries (especially web-related >>> ones) still being rather shonky? :-) >> >> Yeah. So far I've noticed huge problems with cgi, urllib, and >> wsgiref. I hope they get fixed soon. :( > > Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one > seems to want ownership of the issue. Very disappointing. There's also this bug report[3], but it doesn't directly propose the changes that I have. So: does anyone agree, or disagree, that cgi.FieldStorage should be changed to take byte streams, and many of the cgi and urllib.parse functions should become encoding-aware, preferably in time for Python 3.1? The byte-stream change will break compatibility with with Python 3.0, but I strongly feel that treating POST data as text is wrong and should not continue to be supported. -Miles Kaufmann [1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html [2]: http://stackoverflow.com/questions/540342/python-3-0-urllib [3]: http://bugs.python.org/issue4953 From graham.dumpleton at gmail.com Wed Apr 15 23:23:59 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 16 Apr 2009 07:23:59 +1000 Subject: [Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules In-Reply-To: <5ec9495f0904151416l3f5705a9ufd695c02da3cfebf@mail.gmail.com> References: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> <5ec9495f0904151416l3f5705a9ufd695c02da3cfebf@mail.gmail.com> Message-ID: <88e286470904151423h663102bdk49eaef5c258ae33f@mail.gmail.com> 2009/4/16 Miles Kaufmann : > On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote: >> The first issue is that there doesn't seem to be a way to parse >> x-www-form-urlencoded query strings in a character set other than >> UTF-8, for example: >> >> 'premier=un&deuxi%E8me=deux' # latin-1 >> >> The urllib.parse.unquote* functions take encoding and errors >> parameters, but none of the higher-level ones. ?The solution to me >> seems to be that functions that build on top of >> it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage >> constructor--should grow encoding and errors parameters that they pass >> through to the lower-level functions. >> >> The second issue is that the FieldStorage classes work with text input >> streams. ?However, with multipart/form-data posts, posted files aren't >> necessarily in the same encoding as form fields, or may be binary and >> not text at all. ?I would suggest that FieldStorage should be changed >> to take a binary input stream. >> >> [...] > > I'm not quite sure how to interpret the lack of response I've gotten > on this topic. ?Is it just that there's little interest in the cgi > module? ?Should I raise this issue on the python-dev list, or just > open a bug report and start submitting patches? > > There's been a lot of discussion recently about bytes vs. str in email > headers and WSGI environ variables, but I haven't been able to find a > substantive discussion on this specific topic. ?Here are some of the > related quotes I've come across. > > Martin v. L?wis wrote [1]: >> In a CGI application, you shouldn't be using sys.stdin or print(). >> Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw), >> and sys.stdout.buffer.raw. A CGI script essentially does binary IO; >> if you use TextIO, there likely will be bugs (e.g. if you have >> attachments of type application/octet-stream). > > bobince wrote [2]: >> Evan Fosmark wrote: >>> bobince wrote: >>>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion >>>> that hasn't been fixed properly for the new string model. It should >>>> be converting the incoming byte stream to characters before >>>> passing them to urllib. >>>> >>>> Did I mention Python 3.0's libraries (especially web-related >>>> ones) still being rather shonky? :-) >>> >>> Yeah. So far I've noticed huge problems with cgi, urllib, and >>> wsgiref. I hope they get fixed soon. :( >> >> Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one >> seems to want ownership of the issue. Very disappointing. > > There's also this bug report[3], but it doesn't directly propose the > changes that I have. > > So: does anyone agree, or disagree, that cgi.FieldStorage should be > changed to take byte streams, and many of the cgi and urllib.parse > functions should become encoding-aware, preferably in time for Python > 3.1? ?The byte-stream change will break compatibility with with Python > 3.0, but I strongly feel that treating POST data as text is wrong and > should not continue to be supported. > > -Miles Kaufmann > > [1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html > [2]: http://stackoverflow.com/questions/540342/python-3-0-urllib > [3]: http://bugs.python.org/issue4953 Have you read: http://bugs.python.org/issue3300 This was referenced in a prior post here and is likely relevant. A lot of the discussion for that was happening on developers list for Python 3.0. Not sure why someone was taking issue with WEB-SIG list over cgi FieldStorage issues as I don't recollect us having any substantive discussion about it and any problems it has. Graham From milesck at umich.edu Thu Apr 16 00:26:47 2009 From: milesck at umich.edu (Miles Kaufmann) Date: Wed, 15 Apr 2009 18:26:47 -0400 Subject: [Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules In-Reply-To: <88e286470904151423h663102bdk49eaef5c258ae33f@mail.gmail.com> References: <5ec9495f0904111748p49ad255bib898d41e05e57d3d@mail.gmail.com> <5ec9495f0904151416l3f5705a9ufd695c02da3cfebf@mail.gmail.com> <88e286470904151423h663102bdk49eaef5c258ae33f@mail.gmail.com> Message-ID: <5ec9495f0904151526l35aaeb1cl57bddf4b82ccb4ff@mail.gmail.com> On Wed, Apr 15, 2009 at 5:23 PM, Graham Dumpleton wrote: > 2009/4/16 Miles Kaufmann : >> So: does anyone agree, or disagree, that cgi.FieldStorage should be >> changed to take byte streams, and many of the cgi and urllib.parse >> functions should become encoding-aware, preferably in time for Python >> 3.1? ?The byte-stream change will break compatibility with with Python >> 3.0, but I strongly feel that treating POST data as text is wrong and >> should not continue to be supported. > > Have you read: > > ?http://bugs.python.org/issue3300 > > This was referenced in a prior post here and is likely relevant. A lot > of the discussion for that was happening on developers list for Python > 3.0. I hadn't. Thanks for the link! That was a long read, so apologies if I missed anything, but that discussion seems to pertain almost entirely to the urllib.parse.[un]quote* functions; there was only one point where it was mentioned that there would be issues with non-UTF-8 data for higher-level functions[1], and nothing followed from that. I don't think it should be a controversial move to add encoding and errors parameters to the following functions: * urllib.parse.parse_qs * urllib.parse.parse_qsl * urllib.parse.urlencode which, I feel, would be in line with the outcome of the discussion you referenced, shouldn't break any existing code, and would make it possible to parse the "quite prevalent"[2] instances of non-utf-8 query strings like the following: 'premier=un&deuxi%E8me=deux' # latin-1 The parameters would also need to be added to cgi.parse, cgi.parse_multipart, and cgi.FieldStorage, if they were in fact changed to expect a bytes file input, as I suggest. > Not sure why someone was taking issue with WEB-SIG list over cgi > FieldStorage issues as I don't recollect us having any substantive > discussion about it and any problems it has. Exactly; that person's issue was that there hasn't been substantive discussion. Which is what I'm trying to create now. :) -Miles Kaufmann [1]: http://bugs.python.org/msg70970 [2]: http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html From graham.dumpleton at gmail.com Thu Apr 16 09:12:11 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 16 Apr 2009 17:12:11 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> Message-ID: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> 2009/4/4 Robert Brewer : > Alan Kennedy wrote: >> [Bill] >> > I think the controlling reference here is RFC 3875. >> >> I think the controlling references are RFC 2616, RFC 2396 and RFC > 3987. >> >> RFC 2616, the HTTP 1.1 spec, punts on the question of character >> encoding for the request URI. >> >> RFC 2396, the URI spec, says >> >> """ >> ? ?It is expected that a systematic treatment of character encoding >> ? ?within URI will be developed as a future modification of this >> ? ?specification. >> """ >> >> RFC 3987 is that spec, for Internationalized Resource Identifiers. It >> says >> >> """ >> An IRI is a sequence of characters from the Universal Character Set >> (Unicode/ISO 10646). >> """ >> >> and >> >> """ >> 1.2. ?Applicability >> >> ? ?IRIs are designed to be compatible with recommendations for new URI >> ? ?schemes [RFC2718]. ?The compatibility is provided by specifying a >> ? ?well-defined and deterministic mapping from the IRI character >> ? ?sequence to the functionally equivalent URI character sequence. >> ? ?Practical use of IRIs (or IRI references) in place of URIs (or URI >> ? ?references) depends on the following conditions being met: >> """ >> >> followed by >> >> """ >> ? ?c. ?The URI corresponding to the IRI in question has to encode >> ? ? ? ?original characters into octets using UTF-8. ?For new URI >> ? ? ? ?schemes, this is recommended in [RFC2718]. ?It can apply to a >> ? ? ? ?whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], >> ? ? ? ?or the URN syntax [RFC2141]). ?It can apply to a specific part >> of >> ? ? ? ?a URI, such as the fragment identifier (e.g., [XPointer]). ?It >> ? ? ? ?can apply to a specific URI or part(s) thereof. ?For details, >> ? ? ? ?please see section 6.4. >> """ >> >> I think the question is "are people using IRIs in the wild"? If so, >> then we must decide how do we best deal with the problems of >> recognising iso-8859-1+rfc2037 versus utf-8, or whatever >> server-configured encoding the user has chosen. > > Agreed. The Request-URI needs to handle IRI's. The headers mostly > don't--almost all headers are of mostly type "token", which is US-ASCII. > A few are of type "TEXT", which is ISO-8859-1/RFC 2047. The remaining > (sub)values are mostly custom byte sequences: > > field-name ? ? ? ? ? field-value > ---------- ? ? ? ? ? ----------- > Accept ? ? ? ? ? ? ? token > Accept-Charset ? ? ? token > Accept-Encoding ? ? ?token > Accept-Language ? ? ?ALPHA, plus ":", "=", "q" etc > Accept-Ranges ? ? ? ?token > Age ? ? ? ? ? ? ? ? ?DIGIT > Allow ? ? ? ? ? ? ? ?token > Authorization ? ? ? ?token > Cache-Control ? ? ? ?token > Connection ? ? ? ? ? token > Content-Encoding ? ? token > Content-Language ? ? ALPHA > Content-Length ? ? ? DIGIT > Content-Location ? ? absoluteURI | relativeURI > Content-MD5 ? ? ? ? ?base64 of 128 bit md5 digest > Content-Range ? ? ? ?DIGIT, plus "/" etc > Content-Type ? ? ? ? token > Date ? ? ? ? ? ? ? ? HTTP-date > ETag ? ? ? ? ? ? ? ? TEXT and CHAR > Expect ? ? ? ? ? ? ? token, quoted-string > Expires ? ? ? ? ? ? ?HTTP-date > >From ? ? ? ? ? ? ? ? ASCII (see RFC 822) > Host ? ? ? ? ? ? ? ? host ":" port > If-Match ? ? ? ? ? ? TEXT and CHAR > If-Modified-Since ? ?HTTP-date > If-None-Match ? ? ? ?TEXT and CHAR > If-Range ? ? ? ? ? ? TEXT and CHAR | HTTP-date > If-Unmodified-Since ?HTTP-date > Last-Modified ? ? ? ?HTTP-date > Location ? ? ? ? ? ? absoluteURI > Max-Forwards ? ? ? ? DIGIT > Pragma ? ? ? ? ? ? ? token, quoted-string > Proxy-Authenticate ? token > Proxy-Authorization ?token > Range ? ? ? ? ? ? ? ?token > Referer ? ? ? ? ? ? ?absoluteURI | relativeURI > Retry-After ? ? ? ? ?HTTP-date | DIGIT > Server ? ? ? ? ? ? ? token, TEXT > TE ? ? ? ? ? ? ? ? ? token > Trailer ? ? ? ? ? ? ?token > Transfer-Encoding ? ?token > Upgrade ? ? ? ? ? ? ?token > User-Agent ? ? ? ? ? token, TEXT > Vary ? ? ? ? ? ? ? ? token > Via ? ? ? ? ? ? ? ? ?token, host, port > Warning ? ? ? ? ? ? ?quoted-string, HTTP-date, host, port > WWW-Authenticate ? ? token > > > The Content-Location, Location, and Referer headers are problematic > since HTTP borrows those from the URI spec, which deals in characters > and not bytes, as you mentioned. Host, and maybe Via, are also special > due to possible IDNA-encoding. > > Regarding extension headers, I think we should assume that the HTTP/1.1 > spec implies all headers should be token (ASCII) or TEXT (ISO-8859-1). > >From section 4.2: > > ? ?field-content ?= ? ? ? ? ? ? ? ? ? ? and consisting of either *TEXT or combinations > ? ? ? ? ? ? ? ? ? ? of token, separators, and quoted-string> > > In addition, the httpbis effort seems to be enforcing this even more > strongly [1]: > > ? ? message-header = field-name ":" OWS [ field-value ] OWS > ? ? field-name ? ? = token > ? ? field-value ? ?= *( field-content / OWS ) > ? ? field-content ?= *( WSP / VCHAR / obs-text ) > > ? Historically, HTTP has allowed field-content with text in the ISO- > ? 8859-1 [ISO-8859-1] character encoding (allowing other character sets > ? through use of [RFC2047] encoding). ?In practice, most HTTP header > ? field-values use only a subset of the US-ASCII charset [USASCII]. > ? Newly defined header fields SHOULD constrain their field-values to > ? US-ASCII characters. ?Recipients SHOULD treat other (obs-text) octets > ? in field-content as opaque data. > > So, from where I sit, we have: > > ?1. Many header values which are ASCII. > ?2. A few header values which are ISO-8859-1 plus RFC 2047. > ?3. A few header values which are URI's (no specified encoding) or IRI's > (UTF-8). > > I understand the desire to decode ASAP, and I agree with Guido that we > should use a default encoding which the app can override. Looking at the > above, ISO-8859-1 is the best encoding I know of for all three header > cases. ASCII can be used as a valid subset without transcoding; headers > which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be > transcoded by the app if needed, but mangled opaquely by middleware. > > If we make *that* call, then IMO there's no reason not to do the same to > SCRIPT_NAME, PATH_INFO, and QUERY_STRING. I am not sure we ended up with a final answer on all of this, but I don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, any longer. As such, am implementing things as per: http://www.wsgi.org/wsgi/Amendments_1.0 with exception that will not be attempting to do decoding per RFC 2047. Any CGI variables not related to HTTP headers will also be handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. This should be equivalent with what wsgiref does in Python 3.X and basically keeps the status quo. If anyone has any last things to say on all of this, please speak up now. Graham From fumanchu at aminus.org Thu Apr 16 18:33:58 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Thu, 16 Apr 2009 09:33:58 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> References: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> Message-ID: <1239899638.19337.5.camel@haku> On Thu, 2009-04-16 at 00:12 -0700, Graham Dumpleton wrote: > > So, from where I sit, we have: > > > > 1. Many header values which are ASCII. > > 2. A few header values which are ISO-8859-1 plus RFC 2047. > > 3. A few header values which are URI's (no specified encoding) or > IRI's > > (UTF-8). > > > > I understand the desire to decode ASAP, and I agree with Guido that > we > > should use a default encoding which the app can override. Looking at > the > > above, ISO-8859-1 is the best encoding I know of for all three > header > > cases. ASCII can be used as a valid subset without transcoding; > headers > > which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be > > transcoded by the app if needed, but mangled opaquely by middleware. > > > > If we make *that* call, then IMO there's no reason not to do the > > same to SCRIPT_NAME, PATH_INFO, and QUERY_STRING. > > I am not sure we ended up with a final answer on all of this, but I > don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, > any longer. As such, am implementing things as per: > > http://www.wsgi.org/wsgi/Amendments_1.0 > > with exception that will not be attempting to do decoding per RFC > 2047. Any CGI variables not related to HTTP headers will also be > handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. > This should be equivalent with what wsgiref does in Python 3.X and > basically keeps the status quo. > > If anyone has any last things to say on all of this, please speak up > now. > That sounds fine to me, Graham, and is what I'll be implementing in my python3 branch for CherryPy barring any unforeseen impediments. Robert Brewer fumanchu at aminus.org From foom at fuhm.net Thu Apr 16 21:31:18 2009 From: foom at fuhm.net (James Y Knight) Date: Thu, 16 Apr 2009 15:31:18 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> Message-ID: <052B16D8-D0F6-4D85-9A39-9BA9F2F544EA@fuhm.net> On Apr 16, 2009, at 3:12 AM, Graham Dumpleton wrote: > I am not sure we ended up with a final answer on all of this, but I > don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, > any longer. As such, am implementing things as per: > > http://www.wsgi.org/wsgi/Amendments_1.0 > > with exception that will not be attempting to do decoding per RFC > 2047. Any CGI variables not related to HTTP headers will also be > handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. > This should be equivalent with what wsgiref does in Python 3.X and > basically keeps the status quo. > > If anyone has any last things to say on all of this, please speak up > now. IMO it would make more sense to have the headers be bytes instead of strings decoded/encoded with latin-1, but it's not a huge deal... James From graham.dumpleton at gmail.com Fri Apr 17 01:03:27 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 17 Apr 2009 09:03:27 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <052B16D8-D0F6-4D85-9A39-9BA9F2F544EA@fuhm.net> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> <052B16D8-D0F6-4D85-9A39-9BA9F2F544EA@fuhm.net> Message-ID: <88e286470904161603w459683f0n310334e7a101ff6b@mail.gmail.com> 2009/4/17 James Y Knight : > On Apr 16, 2009, at 3:12 AM, Graham Dumpleton wrote: >> >> I am not sure we ended up with a final answer on all of this, but I >> don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, >> any longer. As such, am implementing things as per: >> >> ?http://www.wsgi.org/wsgi/Amendments_1.0 >> >> with exception that will not be attempting to do decoding per RFC >> 2047. Any CGI variables not related to HTTP headers will also be >> handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. >> This should be equivalent with what wsgiref does in Python 3.X and >> basically keeps the status quo. >> >> If anyone has any last things to say on all of this, please speak up now. > > > IMO it would make more sense to have the headers be bytes instead of strings > decoded/encoded with latin-1, but it's not a huge deal... It is a huge deal in as much as we don't use any sort of formal voting process here and for better or worse, rely on consensus. If there is anyone who has countering views and we don't as a group come up with some formal statement about how things should be done, then it makes it very hard for the likes of Robert and myself who need to implement the thing. So, we need to deal with the different views people have and balance them up and make a decision. Until I feel there is some sort of official decision one way or another, I can't release any code. Graham From maluke at gmail.com Fri Apr 17 01:28:23 2009 From: maluke at gmail.com (Sergey Schetinin) Date: Fri, 17 Apr 2009 02:28:23 +0300 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904161603w459683f0n310334e7a101ff6b@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <86217.1238608796@parc.com> <4a951aa00904011615w58651c62ucadd5da07f4a6005@mail.gmail.com> <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> <052B16D8-D0F6-4D85-9A39-9BA9F2F544EA@fuhm.net> <88e286470904161603w459683f0n310334e7a101ff6b@mail.gmail.com> Message-ID: <116315680904161628x73535c8dxde951fc33225f3ea@mail.gmail.com> On Fri, Apr 17, 2009 at 02:03, Graham Dumpleton wrote: > 2009/4/17 James Y Knight : >> On Apr 16, 2009, at 3:12 AM, Graham Dumpleton wrote: >>> >>> I am not sure we ended up with a final answer on all of this, but I >>> don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, >>> any longer. As such, am implementing things as per: >>> >>> ?http://www.wsgi.org/wsgi/Amendments_1.0 >>> >>> with exception that will not be attempting to do decoding per RFC >>> 2047. Any CGI variables not related to HTTP headers will also be >>> handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. >>> This should be equivalent with what wsgiref does in Python 3.X and >>> basically keeps the status quo. >>> >>> If anyone has any last things to say on all of this, please speak up now. >> >> >> IMO it would make more sense to have the headers be bytes instead of strings >> decoded/encoded with latin-1, but it's not a huge deal... > > It is a huge deal in as much as we don't use any sort of formal voting > process here and for better or worse, rely on consensus. If there is > anyone who has countering views and we don't as a group come up with > some formal statement about how things should be done, then it makes > it very hard for the likes of Robert and myself who need to implement > the thing. So, we need to deal with the different views people have > and balance them up and make a decision. Until I feel there is some > sort of official decision one way or another, I can't release any > code. +1 to Amendments. I work with WSGI quite a lot and have a server implementation as well (experimental trellis-based server w/ async app extensions), while I don't plan to use 3.x branch anytime soon, all the amendments make perfect sense to me. I did encounter user-agents that send HTTP path encoded in cp1251 for example, but I don't think that it's a good idea to keep environ values as bytes and expect WSGI apps to sort out the mess. The U-A that sent the broken path seemed to be some sort of spider, so it's not like one would be losing visitors due to this. From graham.dumpleton at gmail.com Fri Apr 17 01:37:57 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 17 Apr 2009 09:37:57 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <1239899638.19337.5.camel@haku> References: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> <1239899638.19337.5.camel@haku> Message-ID: <88e286470904161637p4e7ba4aj6228117d29439ac5@mail.gmail.com> 2009/4/17 Robert Brewer : > On Thu, 2009-04-16 at 00:12 -0700, Graham Dumpleton wrote: >> > So, from where I sit, we have: >> > >> > ?1. Many header values which are ASCII. >> > ?2. A few header values which are ISO-8859-1 plus RFC 2047. >> > ?3. A few header values which are URI's (no specified encoding) or >> IRI's >> > (UTF-8). >> > >> > I understand the desire to decode ASAP, and I agree with Guido that >> we >> > should use a default encoding which the app can override. Looking at >> the >> > above, ISO-8859-1 is the best encoding I know of for all three >> header >> > cases. ASCII can be used as a valid subset without transcoding; >> headers >> > which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be >> > transcoded by the app if needed, but mangled opaquely by middleware. >> > >> > If we make *that* call, then IMO there's no reason not to do the >> > same to SCRIPT_NAME, PATH_INFO, and QUERY_STRING. >> >> I am not sure we ended up with a final answer on all of this, but I >> don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, >> any longer. As such, am implementing things as per: >> >> ? http://www.wsgi.org/wsgi/Amendments_1.0 >> >> with exception that will not be attempting to do decoding per RFC >> 2047. Any CGI variables not related to HTTP headers will also be >> handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. >> This should be equivalent with what wsgiref does in Python 3.X and >> basically keeps the status quo. >> >> If anyone has any last things to say on all of this, please speak up >> now. >> > That sounds fine to me, Graham, and is what I'll be implementing in my > python3 branch for CherryPy barring any unforeseen impediments. Are you moving to use of empty string as end of input sentinel for wsgi.input for case where code does actually read more than CONTENT_LENGTH? Graham From fumanchu at aminus.org Fri Apr 17 02:06:55 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Thu, 16 Apr 2009 17:06:55 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470904161637p4e7ba4aj6228117d29439ac5@mail.gmail.com> References: <88e286470904160012g7c748d8bke584a5325fbdc03@mail.gmail.com> <1239899638.19337.5.camel@haku> <88e286470904161637p4e7ba4aj6228117d29439ac5@mail.gmail.com> Message-ID: <1239926815.19337.13.camel@haku> On Fri, 2009-04-17 at 09:37 +1000, Graham Dumpleton wrote: > >> I am not sure we ended up with a final answer on all of this, but I > >> don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support, > >> any longer. As such, am implementing things as per: > >> > >> http://www.wsgi.org/wsgi/Amendments_1.0 > >> > >> with exception that will not be attempting to do decoding per RFC > >> 2047. Any CGI variables not related to HTTP headers will also be > >> handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING. > >> This should be equivalent with what wsgiref does in Python 3.X and > >> basically keeps the status quo. > >> > > That sounds fine to me, Graham, and is what I'll be implementing in my > > python3 branch for CherryPy barring any unforeseen impediments. > > Are you moving to use of empty string as end of input sentinel for > wsgi.input for case where code does actually read more than > CONTENT_LENGTH? Sure; I think that's reasonable. It's supposed to be 'file-like'. Robert Brewer fumanchu at aminus.org From randy at rcs-comp.com Mon Apr 27 04:32:20 2009 From: randy at rcs-comp.com (Randy Syring) Date: Sun, 26 Apr 2009 22:32:20 -0400 Subject: [Web-SIG] Use 200 or 400 Status Code When... Message-ID: <49F51934.90903@rcs-comp.com> I have a page that accepts URL arguments like: /student/ The id must be an integer or the URL doesn't match and the user is given a 404. But what should I do if the id is given, is an integer, but a student with that id does not exist? I already output a message telling the user that they requested an invalid student. However, should that document have a 200 or 400 (or some other) status code? Thanks. -- -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 From t.broyer at gmail.com Mon Apr 27 10:38:15 2009 From: t.broyer at gmail.com (Thomas Broyer) Date: Mon, 27 Apr 2009 10:38:15 +0200 Subject: [Web-SIG] Use 200 or 400 Status Code When... In-Reply-To: <49F51934.90903@rcs-comp.com> References: <49F51934.90903@rcs-comp.com> Message-ID: On Mon, Apr 27, 2009 at 4:32 AM, Randy Syring wrote: > I have a page that accepts URL arguments like: > > /student/ > > The id must be an integer or the URL doesn't match and the user is given a > 404. ?But what should I do if the id is given, is an integer, but a student > with that id does not exist??I already output a message telling the user > that they requested an invalid student. ?However, should that document have > a 200 or 400 (or some other) status code? Obviously a 404 too, as the URL identifies something that doesn't exist. (in the case of an invalid id, i.e. not a number, you could use 410 status code too) -- Thomas Broyer From randy at rcs-comp.com Mon Apr 27 23:10:34 2009 From: randy at rcs-comp.com (Randy Syring) Date: Mon, 27 Apr 2009 17:10:34 -0400 Subject: [Web-SIG] Use 200 or 400 Status Code When... In-Reply-To: References: <49F51934.90903@rcs-comp.com> Message-ID: <49F61F4A.8000602@rcs-comp.com> Thomas, Unfortunately, it wasn't obvious to me that a 404 was appropriate in this situation. But, now that you mention it, I think you are right. Thank you for your input. -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 Thomas Broyer wrote: > On Mon, Apr 27, 2009 at 4:32 AM, Randy Syring wrote: > >> I have a page that accepts URL arguments like: >> >> /student/ >> >> The id must be an integer or the URL doesn't match and the user is given a >> 404. But what should I do if the id is given, is an integer, but a student >> with that id does not exist? I already output a message telling the user >> that they requested an invalid student. However, should that document have >> a 200 or 400 (or some other) status code? >> > > Obviously a 404 too, as the URL identifies something that doesn't exist. > > (in the case of an invalid id, i.e. not a number, you could use 410 > status code too) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From randy at rcs-comp.com Mon Apr 27 23:19:18 2009 From: randy at rcs-comp.com (Randy Syring) Date: Mon, 27 Apr 2009 17:19:18 -0400 Subject: [Web-SIG] empty action attribute with forms in Google Chrome Message-ID: <49F62156.5080006@rcs-comp.com> For the last four years, I have always used an empty action attribute on my form to make it post back to the current URL. I almost always validate my HTML and this has never come up as a violation. Furthermore, I have read various people on the web advocating this practice. Recently, however, I went to use Google Chrome to look at some of my web apps and I noticed that none of my forms work. In use a tag and empty form attributes. Whenever I submit a form in Chrome, it gets posted to the root URL (i.e. what I have in my tag). Am I violating the spec or is this something Google Chrome got wrong? What I have works in IE, FF, and Opera. Thanks. -- -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 From t.broyer at gmail.com Tue Apr 28 10:01:04 2009 From: t.broyer at gmail.com (Thomas Broyer) Date: Tue, 28 Apr 2009 10:01:04 +0200 Subject: [Web-SIG] empty action attribute with forms in Google Chrome In-Reply-To: <49F62156.5080006@rcs-comp.com> References: <49F62156.5080006@rcs-comp.com> Message-ID: On Mon, Apr 27, 2009 at 11:19 PM, Randy Syring wrote: > For the last four years, I have always used an empty action attribute on my > form to make it post back to the current URL. ?I almost always validate my > HTML and this has never come up as a violation. ?Furthermore, I have read > various people on the web advocating this practice. > > Recently, however, I went to use Google Chrome to look at some of my web > apps and I noticed that none of my forms work. ?In use a tag and > empty form attributes. ?Whenever I submit a form in Chrome, it gets posted > to the root URL (i.e. what I have in my tag). ?Am I violating the > spec or is this something Google Chrome got wrong? You are violating the spec (or, actually, this a bit of a blurry thing in the spec re. a "same document reference"). >?What I have works in IE, FF, and Opera. Yes, because they're violating the spec too. HTML5 defines the form submission to violate the RFC 3986 to make it work like IE, FF and Opera: http://www.w3.org/TR/html5/forms.html#form-submission-algorithm (step 9) The comments there (an HTML comment, look at the source of the page) says: (I'm not sure web-sig is the appropriate list for these questions, as they're unrelated to Python; maybe http://www.whatwg.org/mailing-list or http://forums.whatwg.org/ ) -- Thomas Broyer From randy at rcs-comp.com Tue Apr 28 16:38:54 2009 From: randy at rcs-comp.com (Randy Syring) Date: Tue, 28 Apr 2009 10:38:54 -0400 Subject: [Web-SIG] empty action attribute with forms in Google Chrome In-Reply-To: References: <49F62156.5080006@rcs-comp.com> Message-ID: <49F714FE.4050904@rcs-comp.com> Thomas, Thanks for your info. Looks like I need to change my SOP. And you are right, I should find a different list for these questions. I am using a python web app, but these questions are generic enough to go somewhere else. Thanks for the kind word and your advice. -------------------------------------- Randy Syring RCS Computers & Web Solutions 502-644-4776 http://www.rcs-comp.com "Whether, then, you eat or drink or whatever you do, do all to the glory of God." 1 Cor 10:31 Thomas Broyer wrote: > On Mon, Apr 27, 2009 at 11:19 PM, Randy Syring wrote: > >> For the last four years, I have always used an empty action attribute on my >> form to make it post back to the current URL. I almost always validate my >> HTML and this has never come up as a violation. Furthermore, I have read >> various people on the web advocating this practice. >> >> Recently, however, I went to use Google Chrome to look at some of my web >> apps and I noticed that none of my forms work. In use a tag and >> empty form attributes. Whenever I submit a form in Chrome, it gets posted >> to the root URL (i.e. what I have in my tag). Am I violating the >> spec or is this something Google Chrome got wrong? >> > > You are violating the spec (or, actually, this a bit of a blurry thing > in the spec re. a "same document reference"). > > >> What I have works in IE, FF, and Opera. >> > > Yes, because they're violating the spec too. HTML5 defines the form > submission to violate the RFC 3986 to make it work like IE, FF and > Opera: > http://www.w3.org/TR/html5/forms.html#form-submission-algorithm (step 9) > The comments there (an HTML comment, look at the source of the page) says: > > > (I'm not sure web-sig is the appropriate list for these questions, as > they're unrelated to Python; maybe http://www.whatwg.org/mailing-list > or http://forums.whatwg.org/ ) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: