From graham.dumpleton at gmail.com Wed Jan 9 05:27:06 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 9 Jan 2008 15:27:06 +1100 Subject: [Web-SIG] WSGI Content-Length issues. Message-ID: <88e286470801082027u112d908bxf4185e583a13feac@mail.gmail.com> Can the group mind provide some clarification on the following please. 1. The WSGI specification does not require that a WSGI adapter provide an EOF indicator if an attempt is made to read more data from wsgi.input than defined by request Content-Length. Is though a WSGI adapter required to explicitly discard any request content which wasn't consumed or is the WSGI applications responsibility to ensure that all request content up to the length specified is always consumed? I have seen some reports to suggest that some WSGI adapter/servers do not discard unread content up to Content-Length, resulting in the problem that if Keep-Alive was enabled that the server may incorrectly try and interpret the remaining content as the header of the next request on that same socket connection. Some spam bots on the net which POST to arbitrary URLs are quite good at triggering this scenario where WSGI applications don't consume request content when they weren't expecting it. If the WSGI specification isn't clear on the responsibilities of a WSGI adapter to discard any request content that wasn't consumed then any WSGI application to ensure it works on all hosting mechanisms would have to ensure they always consume request content even if not expected for a URL. 2. If a WSGI application sets a Content-Length in a response and then returns request content of a greater length, should the WSGI adapter attempt to discard any additional output beyond the length set by the application or just pass it through? What obligations do WSGI middleware have in this respect? If the answer is that the WSGI adapter shouldn't care and should just pass everything through, then would it be seen as at least prudent that the WSGI adapter log a warning message that the returned response content differs in length to the specified Content-Length? Same applies where a WSGI application finished successfully but didn't return as much output as it said it was going to. 3. Similarly, where a WSGI adapter supports wsgi.file_wrapper and the Content-Length header was set in the response, should the WSGI adapter send only at most that amount of data? This question applies whether or not the WSGI adapter is able to optimise the sending of the response because of the presence of fileno() or other platform specific feature which would facilitate such optimisations. 4. Where a WSGI adapter supports wsgi.file_wrapper and the Content-Length header was NOT set in the response, where optimisations are being performed and the WSGI adapter can (or must in order to send it) calculate the length of the output, can the WSGI adapter add its own Content-Length header indicating the actual amount of response content sent. Graham From brian at briansmith.org Wed Jan 9 08:33:56 2008 From: brian at briansmith.org (Brian Smith) Date: Tue, 8 Jan 2008 23:33:56 -0800 Subject: [Web-SIG] WSGI Content-Length issues. In-Reply-To: <88e286470801082027u112d908bxf4185e583a13feac@mail.gmail.com> References: <88e286470801082027u112d908bxf4185e583a13feac@mail.gmail.com> Message-ID: <000401c85291$fef4c9d0$2401a8c0@T60> Graham Dumpleton wrote: > Can the group mind provide some clarification on the following please. > > 1. The WSGI specification does not require that a WSGI > adapter provide an EOF indicator if an attempt is made to > read more data from wsgi.input than defined by request > Content-Length. This is not a problem when the Content-Length header is provided in the request, because the application should never read more than bytes. RFC 2616 says "The presence of a message-body in a request is signaled by the inclusion of a Content-Length or Transfer-Encoding header field in the request's message-headers." If those headers are missing, then the application has to assume there is no message body, and the WSGI gateway is free to dispose of any message body it can detect. I do agree that the handling of chunked request bodies is not ideal; the current wording implies that the gateway must buffer the entire chunked request body until it can calculate the Content-Length, before calling the application object. This pretty much defeats the purpose of chunked encoding. On the other hand, it is a pretty minor issue because chunked request bodies are very rare. > Is though a WSGI adapter required to > explicitly discard any request content which wasn't consumed > or is the WSGI applications responsibility to ensure that all > request content up to the length specified is always consumed? Given the existing body of applications that ignore extraneous message bodies, it only makes sense to put the burden on the gateway. In particular, a request entity is allowed syntactically on a GET request, but any such entity must not effect the semantics of the request--that is, an application should always ignore it. And, I've never seen any WSGI applications that attempt to consume request entities on a GET request. It is pretty common to ignore the request entities on PUT and POST requests too (e.g. for conditional requests). > I have seen some reports to suggest that some WSGI > adapter/servers do not discard unread content up to > Content-Length, resulting in the problem that if Keep-Alive > was enabled that the server may incorrectly try and interpret > the remaining content as the header of the next request on > that same socket connection. If the WSGI gateway cannot detect the end of one request and the start of the next one, regardless of what the application does, then it is faulty. That is the primary reason that requires Content-Length or Transfer-Encoding headers on messages with entity bodies. The WSGI spec. could be more explicit, I don't think anybody is going to stand up and say "I refuse to parse requests correctly because PEP 333 doesn't explicitly require me to." I think we just need to report these bugs to the gateway authors and let (help) them fix them. > 2. If a WSGI application sets a Content-Length in a response > and then returns request content of a greater length, should > the WSGI adapter attempt to discard any additional output > beyond the length set by the application or just pass it > through? What obligations do WSGI middleware have in this respect? > > If the answer is that the WSGI adapter shouldn't care and > should just pass everything through, then would it be seen as > at least prudent that the WSGI adapter log a warning message > that the returned response content differs in length to the > specified Content-Length? Same applies where a WSGI > application finished successfully but didn't return as much > output as it said it was going to. If the application wants well-defined behavior, then it should always ensure that it sends a response body that is exactly bytes long. That is because all the front-end web servers, proxy servers, and client applications that process the response depend on the response being compliant with RFC 2616. When the Content-Length header is wrong, the results are unpredictable, regardless of what the WSGI gateway tries to do. When you have to choose between being compliant with RFC 2616 or being compliant with PEP 333, always choose RFC 2616. Consequently, the server is free to do whatever it wants when the Content-Length is wrong: it can truncate overly long entities, or drop the connection entirely. Such results are likely to occur somewhere along the way to the client anyway. The application shouldn't expect a successful or even consistent result. (Note that when I say "the Content-Length is wrong" I am not referring to the case where the application does not include a Content-Length header at all.) > 3. Similarly, where a WSGI adapter supports wsgi.file_wrapper > and the Content-Length header was set in the response, should > the WSGI adapter send only at most that amount of data? This > question applies whether or not the WSGI adapter is able to > optimise the sending of the response because of the presence > of fileno() or other platform specific feature which would > facilitate such optimisations. The specification is clear about this: "The semantics [...] should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached." However, I think this is truly an error in the specification--the gateway should not be required to send more than bytes if the application set the Content-Length header. Really, this is just a special case of the situation described above, where the application is trying to send a larger (or smaller) body than it claimed in the Content-Length header. Again, when you have to choose between being compliant with RFC 2616 or being compliant with PEP 333, always choose RFC 2616. > 4. Where a WSGI adapter supports wsgi.file_wrapper and the > Content-Length header was NOT set in the response, where > optimisations are being performed and the WSGI adapter can > (or must in order to send > it) calculate the length of the output, can the WSGI adapter > add its own Content-Length header indicating the actual > amount of response content sent. PEP 333 already clearly states that the WSGI gateway can add a Content-Length header whenenever it wants to, if the application didn't supply one: "[...T]he server or gateway may be able to either generate a Content-Length header, or at least avoid the need to close the client connection." I do think think that it is a good idea to include these clarifications in (an addendum to) the WSGI spec, as these are all issues that are often overlooked in implementations. - Brian From chris at simplistix.co.uk Mon Jan 14 18:15:44 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Mon, 14 Jan 2008 17:15:44 +0000 Subject: [Web-SIG] serving (potentially large) files through wsgi? In-Reply-To: <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> Message-ID: <478B98C0.4060404@simplistix.co.uk> Phillip J. Eby wrote: > At 02:03 PM 12/21/2007 +0000, Chris Withers wrote: >> I think I'm missing something: what in the logging package makes you >> log by which module issued the message? > > That's the conventional usage: modules that use logging usually use a > static logger based on module name. Take a look at the distutils, for > example. Yeah, but I don't see anything in the logging package that enforces this convention... > It's not common for modules that do logging, to take logger objects as > part of their API, and if they did, it would almost certainly suck. Why would they need to? The logging module has its own registry of loggers. getLogger('x.y.z') only creates a logger if it doesn't already exist... cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From pje at telecommunity.com Mon Jan 14 18:31:04 2008 From: pje at telecommunity.com (Phillip J. Eby) Date: Mon, 14 Jan 2008 12:31:04 -0500 Subject: [Web-SIG] serving (potentially large) files through wsgi? In-Reply-To: <478B98C0.4060404@simplistix.co.uk> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> Message-ID: <20080114173108.DF9623A4077@sparrow.telecommunity.com> At 05:15 PM 1/14/2008 +0000, Chris Withers wrote: >Phillip J. Eby wrote: >>At 02:03 PM 12/21/2007 +0000, Chris Withers wrote: >>>I think I'm missing something: what in the logging package makes >>>you log by which module issued the message? >>That's the conventional usage: modules that use logging usually use >>a static logger based on module name. Take a look at the >>distutils, for example. > >Yeah, but I don't see anything in the logging package that enforces >this convention... > >>It's not common for modules that do logging, to take logger objects >>as part of their API, and if they did, it would almost certainly suck. > >Why would they need to? The logging module has its own registry of loggers. > >getLogger('x.y.z') only creates a logger if it doesn't already exist... You're only shifting the issue from taking loggers as arguments, to logger *names* as arguments. This doesn't change the problem in the least -- it just adds the overhead of doing a lookup. From chris at simplistix.co.uk Tue Jan 15 15:05:46 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 15 Jan 2008 14:05:46 +0000 Subject: [Web-SIG] serving (potentially large) files through wsgi? In-Reply-To: <20080114173108.DF9623A4077@sparrow.telecommunity.com> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> Message-ID: <478CBDBA.2000109@simplistix.co.uk> Phillip J. Eby wrote: >> Why would they need to? The logging module has its own registry of >> loggers. >> >> getLogger('x.y.z') only creates a logger if it doesn't already exist... > > You're only shifting the issue from taking loggers as arguments, to > logger *names* as arguments. Huh? How so? Just compute the logger name where you need it, or use a page path or however you want to slice'n'dice your loggers... Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From pje at telecommunity.com Tue Jan 15 16:16:29 2008 From: pje at telecommunity.com (Phillip J. Eby) Date: Tue, 15 Jan 2008 10:16:29 -0500 Subject: [Web-SIG] serving (potentially large) files through wsgi? In-Reply-To: <478CBDBA.2000109@simplistix.co.uk> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> Message-ID: <20080115151630.371483A40AE@sparrow.telecommunity.com> At 02:05 PM 1/15/2008 +0000, Chris Withers wrote: >Phillip J. Eby wrote: >>>Why would they need to? The logging module has its own registry of loggers. >>> >>>getLogger('x.y.z') only creates a logger if it doesn't already exist... >>You're only shifting the issue from taking loggers as arguments, to >>logger *names* as arguments. > >Huh? How so? Just compute the logger name where you need it, or use >a page path or however you want to slice'n'dice your loggers... And how is, say, an SQL connection object supposed to know what the "page path" is? That was the whole point of this thread: that without passing logger objects around, or having some other dynamic context, there's no way for libraries to direct their log information to the right place. From chris at simplistix.co.uk Wed Jan 16 16:05:26 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Wed, 16 Jan 2008 15:05:26 +0000 Subject: [Web-SIG] loggers and wsgi In-Reply-To: <20080115151630.371483A40AE@sparrow.telecommunity.com> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> <20080115151630.371483A40AE@sparrow.telecommunity.com> Message-ID: <478E1D36.5010600@simplistix.co.uk> Phillip J. Eby wrote: > And how is, say, an SQL connection object supposed to know what the > "page path" is? That was the whole point of this thread: that without > passing logger objects around, or having some other dynamic context, > there's no way for libraries to direct their log information to the > right place. Well, this feels pretty odd to me, but I guess each to their own. Regardless, the problems of wsgi.errors not having any clue about log levels make it an unappealing prospect to use. Still, there's no problem with a wsgi application doing its own logging to its own log files, right? cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From manlio_perillo at libero.it Wed Jan 16 16:21:50 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Wed, 16 Jan 2008 16:21:50 +0100 Subject: [Web-SIG] loggers and wsgi In-Reply-To: <478E1D36.5010600@simplistix.co.uk> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> <20080115151630.371483A40AE@sparrow.telecommunity.com> <478E1D36.5010600@simplistix.co.uk> Message-ID: <478E210E.4010308@libero.it> Chris Withers ha scritto: > Phillip J. Eby wrote: >> And how is, say, an SQL connection object supposed to know what the >> "page path" is? That was the whole point of this thread: that without >> passing logger objects around, or having some other dynamic context, >> there's no way for libraries to direct their log information to the >> right place. > > Well, this feels pretty odd to me, but I guess each to their own. > > Regardless, the problems of wsgi.errors not having any clue about log > levels make it an unappealing prospect to use. wsgi.errors maybe should have an optional method: .msg(level, *args) where args is a list of strings or .msg(*args, **kwargs) where the keys in kwargs are implementation defined. > Still, there's no problem > with a wsgi application doing its own logging to its own log files, right? > There is an interoperability problem with external tools like logrotate, since some WSGI implementation are unable to catch signals. > cheers, > > Chris > Manlio Perillo From chris at simplistix.co.uk Thu Jan 17 13:12:49 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Thu, 17 Jan 2008 12:12:49 +0000 Subject: [Web-SIG] loggers and wsgi In-Reply-To: <478E210E.4010308@libero.it> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> <20080115151630.371483A40AE@sparrow.telecommunity.com> <478E1D36.5010600@simplistix.co.uk> <478E210E.4010308@libero.it> Message-ID: <478F4641.20103@simplistix.co.uk> Manlio Perillo wrote: > > wsgi.errors maybe should have an optional method: > .msg(level, *args) > > where args is a list of strings > > or > .msg(*args, **kwargs) > > where the keys in kwargs are implementation defined. I don't really see how this helps. If it's optional, then ever wsgi app will need a bunch of if/then/else to decide if this method can be called and what to do instead. Likewise, having implementation defined parameters means the application developer has to tie the app to a list of compatible servers and cater for each one. Surely a much better idea would be to give wsgi.errors a logger attribute which behaved like a standard python logger? (or, in fact, just make wsgi.error a python logger object...) The only problem here is that the level specified won't necessarilly match up to the server's idea of levels, but this is a mapping that can either be done intelligently in the server implementation or, worst case, by the person putting the components together in the server configuration files. >> Still, there's no problem with a wsgi application doing its own >> logging to its own log files, right? >> > There is an interoperability problem with external tools like logrotate, > since some WSGI implementation are unable to catch signals. That's why logrotate has copy-truncate ;-) cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From manlio_perillo at libero.it Thu Jan 17 13:33:36 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 17 Jan 2008 13:33:36 +0100 Subject: [Web-SIG] loggers and wsgi In-Reply-To: <478F4641.20103@simplistix.co.uk> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> <20080115151630.371483A40AE@sparrow.telecommunity.com> <478E1D36.5010600@simplistix.co.uk> <478E210E.4010308@libero.it> <478F4641.20103@simplistix.co.uk> Message-ID: <478F4B20.5060707@libero.it> Chris Withers ha scritto: > Manlio Perillo wrote: >> >> wsgi.errors maybe should have an optional method: >> .msg(level, *args) >> >> where args is a list of strings >> >> or >> .msg(*args, **kwargs) >> >> where the keys in kwargs are implementation defined. > > I don't really see how this helps. If it's optional, then ever wsgi app > will need a bunch of if/then/else to decide if this method can be called > and what to do instead. > This is not a problem. The job can be done by a middleware. My idea is to add a message like interface to wsgi.input, in addition to the stream interface. > Likewise, having implementation defined parameters means the application > developer has to tie the app to a list of compatible servers and cater > for each one. > Again, not a real problem, IMHO. This is the only solution for better support several environments. > Surely a much better idea would be to give wsgi.errors a logger > attribute which behaved like a standard python logger? > (or, in fact, just make wsgi.error a python logger object...) > No, I think this is wrong. This can be done, of course, by a middleware. > The only problem here is that the level specified won't necessarilly > match up to the server's idea of levels, but this is a mapping that can > either be done intelligently in the server implementation or, worst > case, by the person putting the components together in the server > configuration files. > >>> Still, there's no problem with a wsgi application doing its own >>> logging to its own log files, right? >>> >> There is an interoperability problem with external tools like >> logrotate, since some WSGI implementation are unable to catch signals. > > That's why logrotate has copy-truncate ;-) > This is only a work around. I think that, where possible, WSGI must allow better integration with the "server environment". By the way, there is still the problem with a stream/message object not bound to a single request; this is required by applications that needs to log, as an example, a database connection pool. > cheers, > > Chris > Manlio Perillo From manlio_perillo at libero.it Thu Jan 17 20:34:09 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 17 Jan 2008 20:34:09 +0100 Subject: [Web-SIG] about wsgiref.headers.Headers Message-ID: <478FADB1.7040802@libero.it> Hi. What is the rationale for Headers._headers being private? Thanks Manlio Perillo From pje at telecommunity.com Thu Jan 17 21:32:32 2008 From: pje at telecommunity.com (Phillip J. Eby) Date: Thu, 17 Jan 2008 15:32:32 -0500 Subject: [Web-SIG] about wsgiref.headers.Headers In-Reply-To: <478FADB1.7040802@libero.it> References: <478FADB1.7040802@libero.it> Message-ID: <20080117203233.80FBC3A4077@sparrow.telecommunity.com> At 08:34 PM 1/17/2008 +0100, Manlio Perillo wrote: >Hi. > >What is the rationale for Headers._headers being private? The code was mostly a copy-and-paste job from email.Message, which did the same. At one point, it might actually have been a subclass of email.Message, and so it was required. From manlio_perillo at libero.it Thu Jan 17 22:58:36 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 17 Jan 2008 22:58:36 +0100 Subject: [Web-SIG] about wsgiref.headers.Headers In-Reply-To: <20080117203233.80FBC3A4077@sparrow.telecommunity.com> References: <478FADB1.7040802@libero.it> <20080117203233.80FBC3A4077@sparrow.telecommunity.com> Message-ID: <478FCF8C.3080602@libero.it> Phillip J. Eby ha scritto: > At 08:34 PM 1/17/2008 +0100, Manlio Perillo wrote: >> Hi. >> >> What is the rationale for Headers._headers being private? > > The code was mostly a copy-and-paste job from email.Message, which did > the same. At one point, it might actually have been a subclass of > email.Message, and so it was required. > Having the `_headers` private is ok for the email.Message, but I think it is wrong for wsgiref, since the Headers class is just a wrapper for the headers list. Manlio Perillo From chris at simplistix.co.uk Fri Jan 18 11:36:53 2008 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 18 Jan 2008 10:36:53 +0000 Subject: [Web-SIG] loggers and wsgi In-Reply-To: <478F4B20.5060707@libero.it> References: <475D2C39.2000409@simplistix.co.uk> <20071217195209.67DE83A40A4@sparrow.telecommunity.com> <4766D6D5.6010403@libero.it> <20071217210700.45C543A40A4@sparrow.telecommunity.com> <47683279.9010008@libero.it> <476855F2.3080007@colorstudy.com> <88e286470712181736g24ba8b73i16dcfe1b3b256de6@mail.gmail.com> <476905DD.5060100@simplistix.co.uk> <4769262B.3050106@libero.it> <476ADADA.10809@simplistix.co.uk> <88e286470712201600k549c4c9dnc6248e404c302f52@mail.gmail.com> <000001c8436a$b54a02e0$0701a8c0@Junk> <20071221011247.EA0413A40AC@sparrow.telecommunity.com> <476BC7B4.3020603@simplistix.co.uk> <20071221163128.9FBB93A40A4@sparrow.telecommunity.com> <478B98C0.4060404@simplistix.co.uk> <20080114173108.DF9623A4077@sparrow.telecommunity.com> <478CBDBA.2000109@simplistix.co.uk> <20080115151630.371483A40AE@sparrow.telecommunity.com> <478E1D36.5010600@simplistix.co.uk> <478E210E.4010308@libero.it> <478F4641.20103@simplistix.co.uk> <478F4B20.5060707@libero.it> Message-ID: <47908145.1050408@simplistix.co.uk> Manlio Perillo wrote: >> I don't really see how this helps. If it's optional, then ever wsgi >> app will need a bunch of if/then/else to decide if this method can be >> called and what to do instead. >> > This is not a problem. The job can be done by a middleware. ...which everyone will then have to use... > My idea is to add a message like interface to wsgi.input, in addition to > the stream interface. That does sound like a plan, although I'd prefer *just* the message like interface, and the more it smells like the standard python logging framework the better. >> Likewise, having implementation defined parameters means the >> application developer has to tie the app to a list of compatible >> servers and cater for each one. >> > Again, not a real problem, IMHO. > This is the only solution for better support several environments. I'll respectfully disagree with you there ;-) >> Surely a much better idea would be to give wsgi.errors a logger >> attribute which behaved like a standard python logger? >> (or, in fact, just make wsgi.error a python logger object...) > > No, I think this is wrong. Why? (the important point is that it behaves like a python logger object, not too fussed about how it's implemented...) >> That's why logrotate has copy-truncate ;-) > > This is only a work around. > I think that, where possible, WSGI must allow better integration with > the "server environment". Again, I'll respectfully disagree with you there on both counts... > By the way, there is still the problem with a stream/message object not > bound to a single request; this is required by applications that needs > to log, as an example, a database connection pool. Not sure what you mean... Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From ben at groovie.org Sat Jan 19 03:02:38 2008 From: ben at groovie.org (Ben Bangert) Date: Fri, 18 Jan 2008 18:02:38 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) Message-ID: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> I unfortunately couldn't find anything in the WSGI spec to indicate whether or not I could expect environ variables relating to the URL to be URL decoded when I get them or whether they reflect the raw URL that was sent to the browser. This recently became an issue, when a user noticed that the %2B URL encoding for a + sign, had turned into a space when it hit their app. Sure enough, Paste was doing URL un-quoting, then Routes, and the double URL un-quote resulted in the + being a space. Is there some definitive word on whether a WSGI application should expect to have it URL un-quoted or not? Cheers, Ben -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2472 bytes Desc: not available Url : http://mail.python.org/pipermail/web-sig/attachments/20080118/6dec7abd/attachment.bin From fumanchu at aminus.org Sat Jan 19 04:07:36 2008 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 18 Jan 2008 19:07:36 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> Message-ID: Ben Bangert wrote: > I unfortunately couldn't find anything in the WSGI spec to indicate > whether or not I could expect environ variables relating to the URL to > be URL decoded when I get them or whether they reflect the raw URL > that was sent to the browser. > > This recently became an issue, when a user noticed that the %2B URL > encoding for a + sign, had turned into a space when it hit their app. > Sure enough, Paste was doing URL un-quoting, then Routes, and the > double URL un-quote resulted in the + being a space. > > Is there some definitive word on whether a WSGI application should > expect to have it URL un-quoted or not? The last time I asked that question here [1], Phillip kindly pointed out to me that that's defined by the CGI spec. I could go through the agony of distributed English-obfuscated BNF analysis again, but I'll just note that I changed CP's wsgiserver to do decoding that very day. So I think the answer is "yes". Robert Brewer fumanchu at aminus.org [1] http://mail.python.org/pipermail/web-sig/2006-August/002230.html From lbruno at 100blossoms.com Sat Jan 19 15:38:09 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Sat, 19 Jan 2008 14:38:09 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> Message-ID: <47920B51.3050201@100blossoms.com> Hello y'all, delurking, I'm using a /-delimited path, %-encoding each literal '/' appearing in the path segments. I was not amused to see egg:Paste#http urldecoding the whole PATH_INFO. Ben Bangert wrote: > This recently became an issue, when a user noticed that the %2B URL > encoding for a + sign, had turned into a space when it hit their app. A swift monkey-patch to paste.httpserver.py:WSGIHandlerMixin.wsgi_setup() later, and ORIGINAL_PATH_INFO is part of the WSGI spec in my world. The following URL now Does The Right Thing: http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/ Rober Brewer wrote: > I changed CP's wsgiserver to do decoding that very day. So I think the > answer is "yes". IMHO "yes" is the wrong answer; I am also very unsure about what is the right answer. I have to walk [urldecode(segment) for segment in ORIGINAL_PATH_INFO.split('/')]; this doesn't look like the Right Answer to me anyway. -- Lu?s Bruno From fumanchu at aminus.org Sat Jan 19 20:13:36 2008 From: fumanchu at aminus.org (Robert Brewer) Date: Sat, 19 Jan 2008 11:13:36 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47920B51.3050201@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> Message-ID: Luis Bruno wrote: > I'm using a /-delimited path, %-encoding each literal '/' appearing in > the path segments. I was not amused to see egg:Paste#http urldecoding > the whole PATH_INFO. All HTTP URI are /-delimited, and any '/' appearing in a single segment that is not intended to participate in the hierarchy semantics must be %-encoded before transmitting it over HTTP. I think that's what you're saying above, but I don't understand why decoding on the server or gateway is a problem. Perhaps you could expand on that: when you say "I'm using", where is that? Inside a WSGI application? > Ben Bangert wrote: > > This recently became an issue, when a user noticed that the %2B URL > > encoding for a + sign, had turned into a space when it hit their app. > > A swift monkey-patch to paste.httpserver.py:WSGIHandlerMixin.wsgi_setup() > later, and ORIGINAL_PATH_INFO is part of the WSGI spec in my world. > The following URL now Does The Right Thing: > > http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/ Platonic Capital Letters won't get you very far with this crowd. You have to explain why you think the application should receive %XX encoded URI's instead of decoded ones. What's the benefit? I only see a con: every piece of middleware that cares has to repeat the decoding of PATH_INFO and SCRIPT_NAME, wasting CPU and memory. > Robert Brewer wrote: > > I changed CP's wsgiserver to do decoding that very day. > > So I think the answer is "yes". > > IMHO "yes" is the wrong answer Why? > I am also very unsure about what is the right answer. According to [1], the right answer is "yes": The PATH_INFO metavariable specifies a path to be interpreted by the CGI script. It identifies the resource or sub-resource to be returned by the CGI script, and it is derived from the portion of the URI path following the script name but preceding any query data. The syntax and semantics are similar to a decoded HTTP URL 'path' token (defined in RFC 2396 [4]), with the exception that a PATH_INFO of "/" represents a single void path segment. Robert Brewer fumanchu at aminus.org [1] http://cgi-spec.golux.com/draft-coar-cgi-v11-03-clean.html#6.1.6 From lbruno at 100blossoms.com Mon Jan 21 12:06:27 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Mon, 21 Jan 2008 11:06:27 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> Message-ID: <47947CB3.5060203@100blossoms.com> I'll top post my "solution"; scare quoted because I'm still not sure this is the smartest idea: environ['wsgiorg.path-segments'] = ['catalog', 'NEC', 'Computers', 'Laptop', 'LN500/9DW'] Robert Brewer wrote: > All HTTP URI are /-delimited, and any '/' appearing in a single segment > that is not intended to participate in the hierarchy semantics must be > %-encoded before transmitting it over HTTP. I wholeheartedly agree. And your explanation is clearer than mine. >> IMHO [changing CP's wsgiserver to do decoding] is the wrong answer > Why? > Because then I'm stuck monkey patching every WSGI server (and/or stuck using my own URL dispatcher) so that I don't lose the information that one of the forward slashes is NOT a path delimiter. You said that %-encoding is meant for slashes not participating in hierarchy semantics, if I read you correctly; so I think you'll agree with me on this. > You have to explain why you think the application should receive %XX encoded > URI's instead of decoded ones. What's the benefit? I only see a con: > every piece of middleware that cares has to repeat the decoding of > PATH_INFO and SCRIPT_NAME, wasting CPU and memory. > I was aware of this trade off, which is why I'm still not sure the application should receive the %-encoded URIs. My app was forced to split the URL on the '/' delimiters. If I can get the framework to do that job while dispatching, so much the better. Hence the solution I top posted. My problem rises when I output a link created from suitably %-encoding these path segments: '/'.join(['NEC', 'Computers', 'Laptop', 'LN500/9DW']) And after the user clicks that link, the framework gives me (and Routes has no way to avoid this when Paste is the one who's doing the whole path decoding): ['NEC', 'Computers', 'Laptop', 'LN500', '9DW'] Think dispatching to a ``callable(*segments, **urlvariables)``. I think we'll agree this is not what the app writer intended. And I'm out of luck if the WSGI server/dispatcher is the one doing the urldecoding. > According to [1], the right answer is "yes": > I'll see your CGI draft and raise you the URI spec[2]. When you've read the last sentence, you'll see how unoriginal the top posted solution was: > 2.4.2. When to Escape and Unescape > > A URI is always in an "escaped" form, since escaping or unescaping a > completed URI might change its semantics. Normally, the only time > escape encodings can safely be made is when the URI is being created > from its component parts; each component may have its own set of > characters that are reserved, so only the mechanism responsible for > generating or interpreting that component can determine whether or > not escaping a character will change its semantics. Likewise, a URI > must be separated into its components before the escaped characters > within those components can be safely decoded. [1] http://cgi-spec.golux.com/draft-coar-cgi-v11-03-clean.html#6.1.6 [2] . There is a CGI Informational RFC somewhere, which I've read diagonally coming here to grumble. -- Lu?s Bruno From ianb at colorstudy.com Mon Jan 21 02:30:20 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Sun, 20 Jan 2008 19:30:20 -0600 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47920B51.3050201@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> Message-ID: <4793F5AC.3030905@colorstudy.com> Luis Bruno wrote: > Hello y'all, delurking, > > I'm using a /-delimited path, %-encoding each literal '/' appearing in > the path segments. I was not amused to see egg:Paste#http urldecoding > the whole PATH_INFO. Unfortunately this is in the WSGI spec, so it's not Paste#http so much as WSGI that demands this. I think in the CGI implementations this is kind of handled by REQUEST_URI containing the quoted value. But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and having the information in duplicate places can lead to errors and unclear situations if they don't match up properly. > Ben Bangert wrote: >> This recently became an issue, when a user noticed that the %2B URL >> encoding for a + sign, had turned into a space when it hit their app. > A swift monkey-patch to > paste.httpserver.py:WSGIHandlerMixin.wsgi_setup() later, and > ORIGINAL_PATH_INFO is part of the WSGI spec in my world. The following > URL now Does The Right Thing: > > http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/ It would be the Right Thing, except for not being WSGI. I made note of this issue on the WSGI 2.0 ideas page, but I don't think anyone (including myself) has proposed any good resolution. Diverging from CGI and leaving PATH_INFO/SCRIPT_NAME quoted would work. But it's libel to lead to bugs as it's a fairly subtle thing and for most applications the semantics won't change and people won't realize their code is broken for some corner case. I suppose we could remove SCRIPT_NAME and PATH_INFO entirely and replace them with new keys. Ian From fumanchu at aminus.org Mon Jan 21 21:01:27 2008 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Jan 2008 12:01:27 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47947CB3.5060203@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com> <47947CB3.5060203@100blossoms.com> Message-ID: Luis Bruno wrote: > Robert Brewer wrote: > > > IMHO [changing CP's wsgiserver to do decoding] is the wrong answer > > Why? > > > Because then I'm stuck monkey patching every WSGI server (and/or stuck > using my own URL dispatcher) so that I don't lose the information that > one of the forward slashes is NOT a path delimiter. You said that > %-encoding is meant for slashes not participating in hierarchy > semantics, if I read you correctly; so I think you'll agree with me on > this. Ah. Now I see. We've had a test case for this since Nov 2005 [1]. FWIW, CherryPy took the option of special-casing forward slashes; those are the only characters which are *not* decoded--they are left as %2F characters in SCRIPT_NAME and PATH_INFO [2]: # Unquote the path+params (e.g. "/this%20path" -> "this path"). # http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2 # # But note that "...a URI must be separated into its components # before the escaped characters within those components can be # safely decoded." http://www.ietf.org/rfc/rfc2396.txt, sec 2.4.2 atoms = [unquote(x) for x in quoted_slash.split(path)] path = "%2F".join(atoms) environ["PATH_INFO"] = path ...and CherryPy then decodes these on the WSGI-app-side, only after the dispatching step (to produce "virtual path" atoms) [3]: if func: # Decode any leftover %2F in the virtual_path atoms. vpath = [x.replace("%2F", "/") for x in vpath] request.handler = LateParamPageHandler(func, *vpath) else: request.handler = cherrypy.NotFound() You're absolutely right; it would be nice to standardize a solution to this. Of course, I'm going to propose we standardize *our* solution. ;) > I'll see your CGI draft and raise you the URI spec. Heh. Quoted in the code comments above. Robert Brewer fumanchu at aminus.org [1] cf http://www.cherrypy.org/ticket/393 [2] http://www.cherrypy.org/browser/trunk/cherrypy/wsgiserver/__init__.py#L3 14 [3] http://www.cherrypy.org/browser/trunk/cherrypy/_cpdispatch.py#L71 From lbruno at 100blossoms.com Tue Jan 22 12:25:50 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Tue, 22 Jan 2008 11:25:50 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <4793F5AC.3030905@colorstudy.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> Message-ID: <4795D2BE.3000805@100blossoms.com> Ian Bicking wrote: > But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and > having the information in duplicate places can lead to errors and > unclear situations if they don't match up properly. True, and you can apply the same reasoning to my suggestion too. Apart from the duplication of information, there's how or where to do the actual decoding. Not everyone is dispatching to a CherryPy-style tree of objects, so putting a %-decoded list of path segments in a environ key doesn't work -- I knew it was a bad idea! I'm going with CherryPy's on this: don't decode "%2F". Should other characters be kept encoded? Also, this crystallizes my thoughts on the matter: %-decoding is the applications' job. Or frameworks'. *Not* the servers'. > Luis Bruno wrote: >> I was not amused to see egg:Paste#http urldecoding the whole PATH_INFO. > Unfortunately this is in the WSGI spec, so it's not Paste#http so much > as WSGI that demands this. Cite? I skimmed PEP 333 before grumbling and I've just re-read it; didn't find it, unless you're referring to the code in "URL Reconstruction" section. If you're referring[*] to the CGI 1.1 draft linked in "environ Variables", I think it supports my position that unquoting(PATH_INFO) was not the correct thing to do. [*] I'm not sure how to spell that. > I made note of this issue on the WSGI 2.0 ideas page Didn't find it here: . Should I look elsewhere? > [/Laptops/LN500%2F9DW/ ] would be the Right Thing, except for not > being WSGI. Looks to me like a good candidate for an amendment. What's the next step? -- Lu?s Bruno From sven at berkvens.net Tue Jan 22 12:47:47 2008 From: sven at berkvens.net (Sven Berkvens-Matthijsse) Date: Tue, 22 Jan 2008 12:47:47 +0100 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <4795D2BE.3000805@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> Message-ID: <20080122114747.GA31507@berkvens.net> Lu?s Bruno wrote: > Ian Bicking wrote: > > But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and > > having the information in duplicate places can lead to errors and > > unclear situations if they don't match up properly. > > True, and you can apply the same reasoning to my suggestion too. > > Apart from the duplication of information, there's how or where to > do the actual decoding. Not everyone is dispatching to a > CherryPy-style tree of objects, so putting a %-decoded list of path > segments in a environ key doesn't work -- I knew it was a bad idea! > I'm going with CherryPy's on this: don't decode "%2F". Should other > characters be kept encoded? Yes, in my opinion all encoded character should remain encoded. Otherwise, a path like /whatever/some%252Fthing/blah/ would become (after decoding): /whatever/some%2Fthing/blah/ which is certainly not what you'd want and/or expect. > Also, this crystallizes my thoughts on the matter: %-decoding is the > applications' job. Or frameworks'. *Not* the servers'. I absolutely agree on this. The application is the only entity that knows how to interpret the (remainder of the) URI properly. > -- > Lu?s Bruno -- het internet begint bij ilse tel: 040 219 32 00 Sven Berkvens-Matthijsse fax: 040 219 32 99 sven at ilse.net url: http://ilse.nl/ From lbruno at 100blossoms.com Tue Jan 22 17:29:09 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Tue, 22 Jan 2008 16:29:09 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <9A6BE623-5CC5-4F82-AAEE-D5C40BA9F54B@fuhm.net> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <20080122114747.GA31507@berkvens.net> <9A6BE623-5CC5-4F82-AAEE-D5C40BA9F54B@fuhm.net> Message-ID: <479619D5.6020104@100blossoms.com> James Y Knight escreveu: > FWIW, I think the right thing for a server to do is to reject any URLs > going to a wsgi (or cgi) script with a %2F in it. I believe this is > what apache's CGI host does. You'd reject the following URL? http://localhost:5000/catalog/NEC/Laptops/LN500%2F9DW/ BTW, I make a beautiful breadcrumb trail out of that: Home > Catalog > NEC > Laptops > *LN500/9DW* > BTW, for extra fun, you should be considering ";" too. True. The urlparse/urlsplit docs mention ';' but I don't understand where/how it's used. From foom at fuhm.net Tue Jan 22 17:02:22 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 22 Jan 2008 11:02:22 -0500 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <20080122114747.GA31507@berkvens.net> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <20080122114747.GA31507@berkvens.net> Message-ID: <9A6BE623-5CC5-4F82-AAEE-D5C40BA9F54B@fuhm.net> On Jan 22, 2008, at 6:47 AM, Sven Berkvens-Matthijsse wrote: > Lu?s Bruno wrote: >> Ian Bicking wrote: >>> But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and >>> having the information in duplicate places can lead to errors and >>> unclear situations if they don't match up properly. >> >> True, and you can apply the same reasoning to my suggestion too. >> >> Apart from the duplication of information, there's how or where to >> do the actual decoding. Not everyone is dispatching to a >> CherryPy-style tree of objects, so putting a %-decoded list of path >> segments in a environ key doesn't work -- I knew it was a bad idea! >> I'm going with CherryPy's on this: don't decode "%2F". Should other >> characters be kept encoded? > > Yes, in my opinion all encoded character should remain encoded. > Otherwise, a path like /whatever/some%252Fthing/blah/ would become > (after decoding): /whatever/some%2Fthing/blah/ which is certainly not > what you'd want and/or expect. Your opinion is irrelevant, this is specified by the CGI spec. Yes, agreed, it's not the best spec ever, but there's nothing you can do about that. FWIW, I think the right thing for a server to do is to reject any URLs going to a wsgi (or cgi) script with a %2F in it. I believe this is what apache's CGI host does. BTW, for extra fun, you should be considering ";" too. James From brian at briansmith.org Tue Jan 22 17:44:43 2008 From: brian at briansmith.org (Brian Smith) Date: Tue, 22 Jan 2008 08:44:43 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <4795D2BE.3000805@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> Message-ID: <001801c85d16$1a3abd60$0501a8c0@T60> Luis Bruno wrote: > Ian Bicking wrote: > > But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and > > having the information in duplicate places can lead to errors and > > unclear situations if they don't match up properly. I don't understand this argument. WSGI gateways just need to parse the request URL correctly, and then everything *will* match up correctly, AFAICT. Providing an undecoded REQUEST_URI that an application can parse on its own is much better than what CherryPy is doing, and it is useful for other reasons as well. > I'm going with CherryPy's on this: don't decode "%2F". CherryPy is not implementing the WSGI 1.0 specification correctly. And, CherryPy's behavior here is harmful, because applications have no way of knowing whether "%2F" is an un-decoded slash, or a literal "%2F". > > Luis Bruno wrote: > >> I was not amused to see egg:Paste#http urldecoding the > >> whole PATH_INFO. > > Unfortunately this is in the WSGI spec, so it's not > > Paste#http so much as WSGI that demands this. > > I skimmed PEP 333 before grumbling and I've just re-read it; > didn't find it, unless you're referring to the code in "URL > Reconstruction" section. > If you're referring[*] to the CGI 1.1 draft linked in "environ > Variables", I think it supports my position that unquoting(PATH_INFO) > was not the correct thing to do. PEP 333 defers the definition of PATH_INFO to the CGI specification: "The environ dictionary is required to contain these CGI environment variables, as defined by the Common Gateway Interface specification [2]". That version of the CGI specification clearly expects PATH_INFO be to decoded. Section 3.2 says "'enc-path-info' is a URL-encoded version of PATH_INFO". The implication is that PATH_INFO is *not* URL-encoded. Section 6.1.6 is more explicit, saying: "The syntax and semantics are similar to a decoded HTTP URL 'path' token (defined in RFC 2396 [4]), with the exception that a PATH_INFO of "/" represents a single void path segment." Furthermore, the URL reconstruction section and the CGI WSGI gateway both also imply that PATH_INFO has already been decoded. > > [/Laptops/LN500%2F9DW/ ] would be the Right Thing, except for not > > being WSGI. > Looks to me like a good candidate for an amendment. > > What's the next step? Something so fundemantal as this cannot be changed with a simple ammendment to the existing specification. Such a change would break currently-conforming gateways and applications. An ammendment that recommends, but does not require, REQUEST_URI is a much better option. - Brian From lbruno at 100blossoms.com Tue Jan 22 19:02:19 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Tue, 22 Jan 2008 18:02:19 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <001801c85d16$1a3abd60$0501a8c0@T60> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> Message-ID: <47962FAB.3060108@100blossoms.com> Brian Smith wrote: > An ammendment that recommends, but does not require, REQUEST_URI is a > much better option. Thereby forcing me to shop around for a WSGI server that actually puts the recommendation into practice? Because I want to keep my %-encoded characters? Which I encoded for, you know, escaping them from the usual processing? Smells of mistake. This sub-thread starts with me putting an ORIGINAL_PATH_INFO into the environ, which the dispatch code doesn't touch. This forces me to strip the app mount points, reinventing Paste#urlmap. Should REQUEST_URI be touched by dispatch code? If so, PATH_INFO has no use. If not, the duplication Ian Bicking mentioned comes into play. > That version of the CGI specification clearly expects PATH_INFO to be decoded. I agree; I think you should refer to the top of page 14 in RFC 3875, instead of to the 1999 draft. The draft didn't outright forbid multiple path-segments like the RFC does, but was ambiguous enough (your quote): > Section 6.1.6 is more explicit, saying: "The syntax and semantics are > similar to a decoded HTTP URL 'path' token (defined in RFC 2396 [4]) > Don't forget to read the %-decoding rules in RFC 2396's section 2.4.2 if you're going to quote "decoded HTTP URL 'path' token". Fortunately, the URI spec doesn't repeat the mistake of forbidding %-encoding characters. It does mention that each path-segment should be separately %-decoded, going against the CGI spec which actually forbids multiple segments *in PATH_INFO*. That smells of mistake. Faced with the choice between those specs, I'd prefer not to lose information for mindless compliance with CGI. -- Lu?s Bruno From brian at briansmith.org Tue Jan 22 19:34:24 2008 From: brian at briansmith.org (Brian Smith) Date: Tue, 22 Jan 2008 10:34:24 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47962FAB.3060108@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com><001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> Message-ID: <000901c85d25$6d4460b0$0501a8c0@T60> Luis Bruno wrote: > Brian Smith wrote: > > An ammendment that recommends, but does not require, > > REQUEST_URI is a much better option. > > Thereby forcing me to shop around for a WSGI server that > actually puts the recommendation into practice? Because I > want to keep my %-encoded characters? Which I encoded for, > you know, escaping them from the usual processing? Smells of > mistake. You already have to shop around for a WSGI server that can distinguish between encoded and unencoded slashes in PATH_INFO, because the WSGI specification doesn't require the WSGI gateway to distinguish between them. I agree that the WSGI 1.0 specification is not good in this regard. However, because an application cannot detect whether PATH_INFO has been decoded or not, the only reasonable thing that it can do is to assume that the gateway and middleware are following the WSGI specification. The corollary is that applications shouldn't rely on being able to distinguish between "%2F" and "/" based on PATH_INFO if it wants to be portable. If you really want PATH_INFO to have "%2F" instead of "/", then I suggest encoding the slashes as "%252F" or "$2F" or something else. Then your application will be portable. > This sub-thread starts with me putting an ORIGINAL_PATH_INFO > into the environ, which the dispatch code doesn't touch. This > forces me to strip the app mount points, reinventing > Paste#urlmap. Should REQUEST_URI be touched by dispatch code? > If so, PATH_INFO has no use. If not, the duplication Ian > Bicking mentioned comes into play. By definition, the Request URI doesn't change during a request. So, REQUEST_URI shouldn't fiddled with by dispatching code, unlike SCRIPT_NAME and PATH_INFO. Usually, the dispatching code is just shifting segments of PATH_INFO into SCRIPT_NAME, but SCRIPT_NAME joined with PATH_INFO and the QUERY_STRING is always constant. So, the problems with ORIGINAl_PATH_INFO don't apply to REQUEST_URI. > > That version of the CGI specification clearly expects > > PATH_INFO to be decoded. > > I agree; I think you should refer to the top of page 14 in > RFC 3875, instead of to the 1999 draft. The draft didn't > outright forbid multiple path-segments like the RFC does, but > was ambiguous enough (your quote): PEP 333 defers the definition of PATH_INFO to the 1999 draft, not to RFC 3875. So, it doesn't matter what RFC 3875 says. > Fortunately, the URI spec doesn't repeat the mistake of > forbidding %-encoding characters. It does mention that each > path-segment should be separately %-decoded, going against > the CGI spec which actually forbids multiple segments *in > PATH_INFO*. That smells of mistake. Faced with the choice > between those specs, I'd prefer not to lose information for > mindless compliance with CGI. I don't care about CGI compatibility. I do depend on WSGI gateways being compliant with the WSGI specification. - Brian From lbruno at 100blossoms.com Tue Jan 22 20:04:38 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Tue, 22 Jan 2008 19:04:38 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <000901c85d25$6d4460b0$0501a8c0@T60> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com><001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <000901c85d25$6d4460b0$0501a8c0@T60> Message-ID: <47963E46.1080409@100blossoms.com> Brian Smith wrote: > If you really want PATH_INFO to have "%2F" instead of "/", then I > suggest encoding the slashes as "%252F" or "$2F" or something else. > Then your application will be portable. I need those '/'. They are the canonical hierarchical delimiters. They are also present in some model names. So yeah, "$2F" might work. I was originally using "!" which isn't used in any model name on my catalog. Please don't read acquiescence into the previous phrase; thinking of escaping escape-chars reeks of stupidity: I can't show this off to my programmer boss, and expect him to quietly accept my judgment without serious amount of explanation. > PEP 333 defers the definition of PATH_INFO to the 1999 draft > True. Please keep in mind that the CGI draft also references the URI syntax spec, which I'll read as supporting my position. > I do depend on WSGI gateways being compliant with the WSGI specification. > We all do, which is why I'm here wasting electrons and everyone's time. Thank you, -- Lu?s Bruno From foom at fuhm.net Tue Jan 22 20:22:07 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 22 Jan 2008 14:22:07 -0500 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47962FAB.3060108@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> Message-ID: <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> On Jan 22, 2008, at 1:02 PM, Luis Bruno wrote: > > Fortunately, the URI spec doesn't repeat the mistake of forbidding > %-encoding characters. It does mention that each path-segment should > be > separately %-decoded, going against the CGI spec which actually > forbids > multiple segments *in PATH_INFO*. That smells of mistake. Faced with > the > choice between those specs, I'd prefer not to lose information for > mindless compliance with CGI. > Where does the CGI spec forbid multiple segments in PATH_INFO? It doesn't. It actually says that PATH_INFO is made by joining each decoded path-segment with a /. And as far as I know /every/ extant implementation does this. And the high quality ones forbid a / from appearing in the decoded segment (aka, from a %2F in the original url), in order to avoid security issues. So I'm not sure what this thread is about. You can argue that the CGI spec has a bug in it, but it's not like this is a new issue or something, and it's shared by every system based on CGI. (PHP for example has the same issue). Besides, the workaround is quite simple: don't use %2F characters in your urls. James From lbruno at 100blossoms.com Tue Jan 22 23:33:57 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Tue, 22 Jan 2008 22:33:57 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> Message-ID: <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> Ian Bicking pointed at CGI 1.1 saying: "See? The WSGI spec tells me to do this!" And he's right. This sub-thread is about *me* thinking the *WSGI spec* should be *fixed*. James Y Knight wrote: > Where does the CGI spec forbid multiple segments in PATH_INFO? > It doesn't. It actually says that PATH_INFO is made by joining each > decoded path-segment with a /. My fault. I misread this: The server MAY reject the request with an error if it encounters any values considered objectionable. That MAY include any requests that would result in an encoded "/" being decoded into PATH_INFO, as this might represent a loss of information to the script. Still, my problem is that "loss of information"; I no longer know which '/' were %-encoded. > And as far as I know /every/ extant implementation does this. As does Paste#http. My fault for not reading correctly. > Besides, the workaround is quite simple: don't use %2F characters in your urls. Should I use $2F? I already *have* an escaping mechanism... which I'm using for spaces, BTW. Why can't I use it for slashes? I came to web-sig@ to fix the spec, not to find a workaround. I already *have* a workaround: it starts with me monkeying around Paste#http and rolling my own dispatcher. Not too bright though, as I could have slapped a $2F in there for a quick workaround (thank you Brian). A quick sanity check here: I think http://host/catalog/some%2Fthing/shallow/ is *meant* to have two nested levels: "some/thing" and "shallow". Is it obvious to you to interpret the URL as having three nested levels "some", "thing" and "shallow"? I ask because the first choice is very obvious to me; I'm treating the second one (current behaviour) as a bug to be fixed. Anyone else thinks it's a bug in WSGI too? -- Luis Bruno From foom at fuhm.net Wed Jan 23 00:21:59 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 22 Jan 2008 18:21:59 -0500 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> Message-ID: On Jan 22, 2008, at 5:33 PM, Luis Bruno wrote: > A quick sanity check here: I think > http://host/catalog/some%2Fthing/shallow/ is *meant* to have two > nested levels: "some/thing" and "shallow". Is it obvious to you to > interpret the URL as having three nested levels "some", "thing" and > "shallow"? I ask because the first choice is very obvious to me; I'm > treating the second one (current behaviour) as a bug to be fixed. You're right, it certainly shouldn't be interpreted as the same URL as some/thing/shallow. That is most likely an avenue for a security exploit if your server does so, and the server should likely be fixed. However, as there is simply no way to represent "some%2Fthing/ shallow/" with PATH_INFO, as specified in the CGI spec, the only alternative is to reject the request. This is what the major servers do today. > Anyone else thinks it's a bug in WSGI too? WSGI is based upon CGI and inherits this behavior. I suppose a WSGI- specific fix could be done. However, there are good reasons for inheriting behavior from CGI, most importantly, ease of integration. Servers already implement this behavior for CGI SCGI FastCGI PHP, and now, WSGI. None of the previous have seen it as important enough an issue to change this behavior, and neither do I think it important enough for WSGI. So, no, I don't consider it a bug in WSGI. You could call it a bug in CGI if you like. Good luck getting it changed. James From fumanchu at aminus.org Wed Jan 23 18:15:58 2008 From: fumanchu at aminus.org (Robert Brewer) Date: Wed, 23 Jan 2008 09:15:58 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org><47920B51.3050201@100blossoms.com><4793F5AC.3030905@colorstudy.com><4795D2BE.3000805@100blossoms.com><001801c85d16$1a3abd60$0501a8c0@T60><47962FAB.3060108@100blossoms.com><83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net><7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> Message-ID: James Y Knight wrote: > ...as there is simply no way to represent "some%2Fthing/ > shallow/" with PATH_INFO, as specified in the CGI spec, the only > alternative is to reject the request. This is what the major servers > do today. > > > Anyone else thinks it's a bug in WSGI too? > > WSGI is based upon CGI and inherits this behavior. I suppose a WSGI- > specific fix could be done. However, there are good reasons for > inheriting behavior from CGI, most importantly, ease of integration. > Servers already implement this behavior for CGI SCGI FastCGI PHP, and > now, WSGI. None of the previous have seen it as important enough an > issue to change this behavior, and neither do I think it important > enough for WSGI. > > So, no, I don't consider it a bug in WSGI. You could call it a bug in > CGI if you like. Good luck getting it changed. I consider it a bug in both, and the difficulty level of changing the CGI behavior really has no bearing on our decision to do better with WSGI. I think it's important that we allow the full range of URI's to be accepted. If you go and stick Apache in front of your WSGI app, it will still 404, sure; but that's your choice to use Apache or not. There's no sense making WSGI a least common denominator, inheriting all the limitations of all the existing web servers. Robert Brewer fumanchu at aminus.org From manlio_perillo at libero.it Wed Jan 23 19:24:34 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Wed, 23 Jan 2008 19:24:34 +0100 Subject: [Web-SIG] code copyright for the spec on wsgi.org Message-ID: <47978662.4040601@libero.it> Hi. I know that this is a "legal detail", but I don't see any copyright notices in the specs at http://wsgi.org/wsgi/Specifications/ Maybe they should be added. As an example, I would like to use the code in the example for the routing_args specification. Thanks Manlio Perillo From pje at telecommunity.com Wed Jan 23 19:18:38 2008 From: pje at telecommunity.com (Phillip J. Eby) Date: Wed, 23 Jan 2008 13:18:38 -0500 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> Message-ID: <20080123182555.68D723A40A2@sparrow.telecommunity.com> At 09:15 AM 1/23/2008 -0800, Robert Brewer wrote: >I consider it a bug in both, and the difficulty level of changing the >CGI behavior really has no bearing on our decision to do better with >WSGI. I think it's important that we allow the full range of URI's to be >accepted. If you go and stick Apache in front of your WSGI app, it will >still 404, sure; but that's your choice to use Apache or not. There's no >sense making WSGI a least common denominator, inheriting all the >limitations of all the existing web servers. Uh, actually, that's sort of the whole point of WSGI - to allow portable applications. If the spec allows you to do something in theory that's almost never allowed in practice, that's not very helpful. I don't consider WSGI's CGI compatibility on this point to be an error, in other words. An application that expects to receive encoded URLs is going to be *very* limited in its deployment choices, and needs to find its own way of dealing with this. MoinMoin, for example, has its own encoding scheme for handling pseudo-slashes in paths, and IMO it's a better way to handle it than trying to rely on finding a server that supports *not* decoding URLs. From brian at briansmith.org Wed Jan 23 19:40:09 2008 From: brian at briansmith.org (Brian Smith) Date: Wed, 23 Jan 2008 10:40:09 -0800 Subject: [Web-SIG] code copyright for the spec on wsgi.org In-Reply-To: <47978662.4040601@libero.it> References: <47978662.4040601@libero.it> Message-ID: <000601c85def$64e81f40$0501a8c0@T60> Manlio Perillo wrote: > As an example, I would like to use the code in the example for the > routing_args specification. Manlio, Are you planning to implement the routing_args specification directly in NGinx's mod_wsgi? I think doing so is a really bad idea--routing_args should be set and manipulated by dispatching middleware only. If NGinx's mod_wsgi sets the wsgiorg.routing_args, but dispatching middlewhere layered on top of it does not update it, then the application will end up being misinformed. - Brian From manlio_perillo at libero.it Wed Jan 23 20:49:39 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Wed, 23 Jan 2008 20:49:39 +0100 Subject: [Web-SIG] code copyright for the spec on wsgi.org In-Reply-To: <000601c85def$64e81f40$0501a8c0@T60> References: <47978662.4040601@libero.it> <000601c85def$64e81f40$0501a8c0@T60> Message-ID: <47979A53.2020101@libero.it> Brian Smith ha scritto: > Manlio Perillo wrote: >> As an example, I would like to use the code in the example for the >> routing_args specification. > > Manlio, > > Are you planning to implement the routing_args specification directly in > NGinx's mod_wsgi? No, of course! I'm writing a (yet another) mini framework based on WSGI (but without adding high level interfaces like request/response objects), that will take advantage of special features of the WSGI implementation for Nginx, if available. My plan is to keep mod_wsgi for nginx as simple as possible. > I think doing so is a really bad idea--routing_args > should be set and manipulated by dispatching middleware only. By the way, you are saying that the dispatching can be put in a middleware, but is this really true? It seems, to me, that the url dispatching should be done in the WSGI application "entry point". > [....] Manlio Perillo From ianb at colorstudy.com Thu Jan 24 07:12:27 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 00:12:27 -0600 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <4795D2BE.3000805@100blossoms.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> Message-ID: <47982C4B.3080707@colorstudy.com> Luis Bruno wrote: >> I made note of this issue on the WSGI 2.0 ideas page > Didn't find it here: . Should I look > elsewhere? I thought I had added it there, but wrote that when I was offline and couldn't check. I added a section about it (a very brief section, though; probably a link to this thread would be helpful). Ian From ianb at colorstudy.com Thu Jan 24 07:22:06 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 00:22:06 -0600 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <20080123182555.68D723A40A2@sparrow.telecommunity.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> <20080123182555.68D723A40A2@sparrow.telecommunity.com> Message-ID: <47982E8E.6000209@colorstudy.com> Phillip J. Eby wrote: > At 09:15 AM 1/23/2008 -0800, Robert Brewer wrote: >> I consider it a bug in both, and the difficulty level of changing the >> CGI behavior really has no bearing on our decision to do better with >> WSGI. I think it's important that we allow the full range of URI's to be >> accepted. If you go and stick Apache in front of your WSGI app, it will >> still 404, sure; but that's your choice to use Apache or not. There's no >> sense making WSGI a least common denominator, inheriting all the >> limitations of all the existing web servers. > > Uh, actually, that's sort of the whole point of WSGI - to allow > portable applications. If the spec allows you to do something in > theory that's almost never allowed in practice, that's not very helpful. It could probably work in a good number of implementations, but because some gateways could lose or reject the encoding, the deployment becomes kind of fragile. Of course you could argue the same thing about SCRIPT_NAME -- it's constantly getting lost and makes deployments seem fragile at times. But in contrast to this issue, it's actually quite useful; distinguishing %2f and / is more of a corner case. > MoinMoin, for example, has its own encoding scheme for handling > pseudo-slashes in paths, and IMO it's a better way to handle it than > trying to rely on finding a server that supports *not* decoding URLs. We encountered it with GData too, as it uses URLs like /{http:%2f%2fexample.com}term/. But if you balance the {}'s you can parse it out. Ian From ianb at colorstudy.com Thu Jan 24 07:24:56 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 00:24:56 -0600 Subject: [Web-SIG] code copyright for the spec on wsgi.org In-Reply-To: <47978662.4040601@libero.it> References: <47978662.4040601@libero.it> Message-ID: <47982F38.6080207@colorstudy.com> Manlio Perillo wrote: > I know that this is a "legal detail", but I don't see any copyright > notices in the specs at http://wsgi.org/wsgi/Specifications/ > > > Maybe they should be added. I added a general note that everything there is public domain, as I think that fits the goals of that section. I wrote the routing_args code specifically, and consider it public domain. Ian From manlio_perillo at libero.it Thu Jan 24 15:22:21 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 24 Jan 2008 15:22:21 +0100 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME Message-ID: <47989F1D.7080802@libero.it> Hi. I have implemented the wsgiorg.routing_args specification, using the code in the example. However I have a problem, and I can't see a good solution. Suppose that an application is mounted (embedded in a web server) at location "/example". The application script executed by the server simply setups the routings, database connections and so. Let's suppose that the request uri is "/example/login/". For the main application, SCRIPT_NAME is "/example". For the application at "/login", SCRIPT_NAME is "/example/login". My problem is that I want, in the page generated by "login" application, return an anchor in the form "/example/logout/". The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this will return "/example/login/logout/", and not "/example/logout/". This seems to be not possible with the current specifications, since the "original" SCRIPT_NAME is lost. What is the best solution? 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a list. This means that the request uri recostruction must be changed: SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) 2) Store a wsgiorg.original_script_name, with the value seen by the routing application. 3) Simply don't change SCRIPT_NAME and PATH_INFO. However I usually need the updated PATH_INFO. Thanks Manlio Perillo From pje at telecommunity.com Thu Jan 24 15:44:28 2008 From: pje at telecommunity.com (Phillip J. Eby) Date: Thu, 24 Jan 2008 09:44:28 -0500 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <47989F1D.7080802@libero.it> References: <47989F1D.7080802@libero.it> Message-ID: <20080124144429.E8C353A40AF@sparrow.telecommunity.com> At 03:22 PM 1/24/2008 +0100, Manlio Perillo wrote: >Let's suppose that the request uri is "/example/login/". > >For the main application, SCRIPT_NAME is "/example". >For the application at "/login", SCRIPT_NAME is "/example/login". > >My problem is that I want, in the page generated by "login" application, >return an anchor in the form "/example/logout/". > >The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this >will return "/example/login/logout/", and not "/example/logout/". > >This seems to be not possible with the current specifications, since the >"original" SCRIPT_NAME is lost. > >What is the best solution? > >1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a > list. > > This means that the request uri recostruction must be changed: > SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) > >2) Store a wsgiorg.original_script_name, with the value seen by the > routing application. > >3) Simply don't change SCRIPT_NAME and PATH_INFO. > However I usually need the updated PATH_INFO. 4) Use a relative link, with href="logout". From manlio_perillo at libero.it Thu Jan 24 15:50:32 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 24 Jan 2008 15:50:32 +0100 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <20080124144429.E8C353A40AF@sparrow.telecommunity.com> References: <47989F1D.7080802@libero.it> <20080124144429.E8C353A40AF@sparrow.telecommunity.com> Message-ID: <4798A5B8.10507@libero.it> Phillip J. Eby ha scritto: > At 03:22 PM 1/24/2008 +0100, Manlio Perillo wrote: >> Let's suppose that the request uri is "/example/login/". >> >> For the main application, SCRIPT_NAME is "/example". >> For the application at "/login", SCRIPT_NAME is "/example/login". >> >> My problem is that I want, in the page generated by "login" application, >> return an anchor in the form "/example/logout/". >> >> The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this >> will return "/example/login/logout/", and not "/example/logout/". >> >> This seems to be not possible with the current specifications, since the >> "original" SCRIPT_NAME is lost. >> >> What is the best solution? >> >> 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a >> list. >> >> This means that the request uri recostruction must be changed: >> SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) >> >> 2) Store a wsgiorg.original_script_name, with the value seen by the >> routing application. >> >> 3) Simply don't change SCRIPT_NAME and PATH_INFO. >> However I usually need the updated PATH_INFO. > > 4) Use a relative link, with href="logout". > But since the base url is "/example/login/", this relative link is resolved to "/example/login/logout/". Manlio Perillo From sven at berkvens.net Thu Jan 24 15:53:06 2008 From: sven at berkvens.net (Sven Berkvens-Matthijsse) Date: Thu, 24 Jan 2008 15:53:06 +0100 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <4798A5B8.10507@libero.it> References: <47989F1D.7080802@libero.it> <20080124144429.E8C353A40AF@sparrow.telecommunity.com> <4798A5B8.10507@libero.it> Message-ID: <20080124145306.GA84549@berkvens.net> > Phillip J. Eby ha scritto: > > At 03:22 PM 1/24/2008 +0100, Manlio Perillo wrote: > >> Let's suppose that the request uri is "/example/login/". > >> > >> For the main application, SCRIPT_NAME is "/example". > >> For the application at "/login", SCRIPT_NAME is "/example/login". > >> > >> My problem is that I want, in the page generated by "login" application, > >> return an anchor in the form "/example/logout/". > >> > >> The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this > >> will return "/example/login/logout/", and not "/example/logout/". > >> > >> This seems to be not possible with the current specifications, since the > >> "original" SCRIPT_NAME is lost. > >> > >> What is the best solution? > >> > >> 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a > >> list. > >> > >> This means that the request uri recostruction must be changed: > >> SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) > >> > >> 2) Store a wsgiorg.original_script_name, with the value seen by the > >> routing application. > >> > >> 3) Simply don't change SCRIPT_NAME and PATH_INFO. > >> However I usually need the updated PATH_INFO. > > > > 4) Use a relative link, with href="logout". > > > > But since the base url is "/example/login/", this relative link is > resolved to "/example/login/logout/". In that case, use href="../logout/". > Manlio Perillo -- Sven From manlio_perillo at libero.it Thu Jan 24 15:58:19 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 24 Jan 2008 15:58:19 +0100 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <20080124145306.GA84549@berkvens.net> References: <47989F1D.7080802@libero.it> <20080124144429.E8C353A40AF@sparrow.telecommunity.com> <4798A5B8.10507@libero.it> <20080124145306.GA84549@berkvens.net> Message-ID: <4798A78B.8040700@libero.it> Sven Berkvens-Matthijsse ha scritto: >> Phillip J. Eby ha scritto: >>> At 03:22 PM 1/24/2008 +0100, Manlio Perillo wrote: >>>> Let's suppose that the request uri is "/example/login/". >>>> >>>> For the main application, SCRIPT_NAME is "/example". >>>> For the application at "/login", SCRIPT_NAME is "/example/login". >>>> >>>> My problem is that I want, in the page generated by "login" application, >>>> return an anchor in the form "/example/logout/". >>>> >>>> The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this >>>> will return "/example/login/logout/", and not "/example/logout/". >>>> >>>> This seems to be not possible with the current specifications, since the >>>> "original" SCRIPT_NAME is lost. >>>> >>>> What is the best solution? >>>> >>>> 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a >>>> list. >>>> >>>> This means that the request uri recostruction must be changed: >>>> SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) >>>> >>>> 2) Store a wsgiorg.original_script_name, with the value seen by the >>>> routing application. >>>> >>>> 3) Simply don't change SCRIPT_NAME and PATH_INFO. >>>> However I usually need the updated PATH_INFO. >>> 4) Use a relative link, with href="logout". >>> >> But since the base url is "/example/login/", this relative link is >> resolved to "/example/login/logout/". > > In that case, use href="../logout/". > I would not call this a solution! It's only a workaround. Manlio Perillo From brian at briansmith.org Thu Jan 24 19:04:16 2008 From: brian at briansmith.org (Brian Smith) Date: Thu, 24 Jan 2008 10:04:16 -0800 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47982E8E.6000209@colorstudy.com> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> <20080123182555.68D723A40A2@sparrow.telecommunity.com> <47982E8E.6000209@colorstudy.com> Message-ID: <003701c85eb3$8c62c7e0$0401a8c0@T60> Ian Bicking wrote: > We encountered it with GData too, as it uses URLs like > /{http:%2f%2fexample.com}term/. But if you balance the {}'s > you can parse it out. Unquoted curly braces are illegal in any kind of URI or IRI. Does GData really require them to be unquoted? - Brian From brian at briansmith.org Thu Jan 24 19:59:07 2008 From: brian at briansmith.org (Brian Smith) Date: Thu, 24 Jan 2008 10:59:07 -0800 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware Message-ID: <000401c85ebb$35385a40$0401a8c0@T60> My application correctly responds to HEAD requests as-is. However, it doesn't work with middleware that sets headers based on the content of the response body. For example, a gateway or middleware that sets ETag based on an checksum, Content-Encoding, Content-Length and/or Content-MD5 will all result in wrong results by default. Right now, my applications assume that any such gateway or the first such middleware will change environ["REQUEST_METHOD"] from "HEAD" to "GET" before the application is invoked, and discard the response body that the application generates. However, many gateways and middleware do not do this, and PEP 333 doesn't have anything to say about it. As a result, a 100% WSGI 1.0-compliant application is not portable between gateways. I suggest that a revision of PEP 333 should require the following behavior: 1. WSGI gateways must always set environ["REQUEST_METHOD"] to "GET" for HEAD requests. Middleware and applications will not be able to detect the difference between GET and HEAD requests. 2. For a HEAD request, A WSGI gateway must not iterate through the response iterable, but it must call the response iterable's close() method, if any. It must not send any output that was written via start_response(...).write() either. Consequently, WSGI applications must work correctly, and must not leak resources, when their output is not iterated; an application should not signal or log an error if the iterable's close() method is invoked without any iteration taking place. Please add this issue to http://wsgi.org/wsgi/WSGI_2.0. Regards, Brian From chrism at plope.com Thu Jan 24 20:16:55 2008 From: chrism at plope.com (Chris McDonough) Date: Thu, 24 Jan 2008 14:16:55 -0500 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <000401c85ebb$35385a40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> Message-ID: <4798E427.3090403@plope.com> I have applications that do detect the difference between a GET and a HEAD (they do slightly less work if the request is a HEAD request), so I suspect this is not a totally reasonable thing to add to the spec. Maybe instead the middleware that does what you're describing should be changed instead to deal with HEAD requests. In general, I don't think is (or should be) any guarantee that an arbitrary middleware stack will work with an arbitrary application. Although that would be nice in theory, I suspect it would require a very complex protocol (more complex than what WSGI requires now). - C Brian Smith wrote: > My application correctly responds to HEAD requests as-is. However, it doesn't work with middleware that sets headers based on the content of the response body. > > For example, a gateway or middleware that sets ETag based on an checksum, Content-Encoding, Content-Length and/or Content-MD5 will all result in wrong results by default. Right now, my applications assume that any such gateway or the first such middleware will change environ["REQUEST_METHOD"] from "HEAD" to "GET" before the application is invoked, and discard the response body that the application generates. > > However, many gateways and middleware do not do this, and PEP 333 doesn't have anything to say about it. As a result, a 100% WSGI 1.0-compliant application is not portable between gateways. > > I suggest that a revision of PEP 333 should require the following behavior: > > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to "GET" for HEAD requests. Middleware and applications will not be able to detect the difference between GET and HEAD requests. > > 2. For a HEAD request, A WSGI gateway must not iterate through the response iterable, but it must call the response iterable's close() method, if any. It must not send any output that was written via start_response(...).write() either. Consequently, WSGI applications must work correctly, and must not leak resources, when their output is not iterated; an application should not signal or log an error if the iterable's close() method is invoked without any iteration taking place. > > Please add this issue to http://wsgi.org/wsgi/WSGI_2.0. > > Regards, > Brian > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com > From manlio_perillo at libero.it Thu Jan 24 20:35:13 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 24 Jan 2008 20:35:13 +0100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <000401c85ebb$35385a40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> Message-ID: <4798E871.5050306@libero.it> Brian Smith ha scritto: > My application correctly responds to HEAD requests as-is. However, it doesn't work with middleware that sets headers based on the content of the response body. > > For example, a gateway or middleware that sets ETag based on an checksum, Content-Encoding, Content-Length and/or Content-MD5 will all result in wrong results by default. Right now, my applications assume that any such gateway or the first such middleware will change environ["REQUEST_METHOD"] from "HEAD" to "GET" before the application is invoked, and discard the response body that the application generates. > > However, many gateways and middleware do not do this, and PEP 333 doesn't have anything to say about it. As a result, a 100% WSGI 1.0-compliant application is not portable between gateways. > > I suggest that a revision of PEP 333 should require the following behavior: > > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to "GET" for HEAD requests. Middleware and applications will not be able to detect the difference between GET and HEAD requests. > -1. > 2. For a HEAD request, A WSGI gateway must not iterate through the response iterable, but it must call the response iterable's close() method, if any. It must not send any output that was written via start_response(...).write() either. Consequently, WSGI applications must work correctly, and must not leak resources, when their output is not iterated; an application should not signal or log an error if the iterable's close() method is invoked without any iteration taking place. > This is done in the WSGI implementation for Nginx, as an example; and some time ago there was a discussion about this. Moreover, if the response iterable is a generator, no iteration (and content generation) is done. > Please add this issue to http://wsgi.org/wsgi/WSGI_2.0. > > Regards, > Brian > Regards Manlio Perillo From ianb at colorstudy.com Thu Jan 24 20:54:34 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 13:54:34 -0600 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <003701c85eb3$8c62c7e0$0401a8c0@T60> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> <20080123182555.68D723A40A2@sparrow.telecommunity.com> <47982E8E.6000209@colorstudy.com> <003701c85eb3$8c62c7e0$0401a8c0@T60> Message-ID: <4798ECFA.5060904@colorstudy.com> Brian Smith wrote: > Ian Bicking wrote: > >> We encountered it with GData too, as it uses URLs like >> /{http:%2f%2fexample.com}term/. But if you balance the {}'s >> you can parse it out. > > Unquoted curly braces are illegal in any kind of URI or IRI. Does GData > really require them to be unquoted? No, quoted is fine. Of course parsing PATH_INFO I couldn't tell anyway ;) Ian From ianb at colorstudy.com Thu Jan 24 21:37:47 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 14:37:47 -0600 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <000401c85ebb$35385a40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> Message-ID: <4798F71B.5080105@colorstudy.com> Brian Smith wrote: > My application correctly responds to HEAD requests as-is. However, it > doesn't work with middleware that sets headers based on the content > of the response body. > > For example, a gateway or middleware that sets ETag based on an > checksum, Content-Encoding, Content-Length and/or Content-MD5 will > all result in wrong results by default. Right now, my applications > assume that any such gateway or the first such middleware will change > environ["REQUEST_METHOD"] from "HEAD" to "GET" before the application > is invoked, and discard the response body that the application > generates. Then the middleware is just wrong. It shouldn't overwrite ETag values generated by the application, and if it is set to generate ETags from hashes of the content then it should change HEAD to GET. > However, many gateways and middleware do not do this, and PEP 333 > doesn't have anything to say about it. As a result, a 100% WSGI > 1.0-compliant application is not portable between gateways. Nothing in WSGI says that all middleware is sensible or correct. In this case it just seems like there's a bad middleware involved that isn't respecting basic HTTP semantics. WSGI doesn't specify HTTP semantics but of course they are a basic foundation for any kind of interaction, and it's assumed they'll be respected. Ian From ianb at colorstudy.com Thu Jan 24 21:45:11 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 14:45:11 -0600 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <47989F1D.7080802@libero.it> References: <47989F1D.7080802@libero.it> Message-ID: <4798F8D7.1010002@colorstudy.com> Manlio Perillo wrote: > I have implemented the wsgiorg.routing_args specification, using the > code in the example. > > However I have a problem, and I can't see a good solution. > > Suppose that an application is mounted (embedded in a web server) at > location "/example". > > The application script executed by the server simply setups the > routings, database connections and so. > > Let's suppose that the request uri is "/example/login/". > > For the main application, SCRIPT_NAME is "/example". > For the application at "/login", SCRIPT_NAME is "/example/login". > > > My problem is that I want, in the page generated by "login" application, > return an anchor in the form "/example/logout/". > > The usual solution is to do environ['SCRIPT_NAME'] + '/logout', but this > will return "/example/login/logout/", and not "/example/logout/". > > This seems to be not possible with the current specifications, since the > "original" SCRIPT_NAME is lost. > > What is the best solution? > > 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a > list. > > This means that the request uri recostruction must be changed: > SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) I suppose you could leave stuff on PATH_INFO. But that doesn't seem to fit with the idea of PATH_INFO. Also, will it be strictly SCRIPT_NAME/consumed_path/PATH_INFO, or could it be SCRIPT_NAME/consumed_path/some_other_parsing/consumed_path/PATH_INFO -- after all, there's cases where stuff gets pushed from PATH_INFO to SCRIPT_NAME, and if consumed_path is in between, which one do you push stuff to? > 2) Store a wsgiorg.original_script_name, with the value seen by the > routing application. I guess I usually do something like this, typically storing myapp.base_path for use when I am generation application-absolute URLs (like /logout). Then at the first chance (before running any kind of routing) I do "environ['myapp.base_path'] = environ['SCRIPT_NAME']". This ad hoc technique works fine, but is very ad hoc. I'm not sure what the best way to handle this is, really. I'm not sure there's a singular root for an entire request, if you are nesting applications, so a single key (wsgiorg.original_script_name) doesn't seem quite right. I can't remember what Routes does for URL generation. Maybe it leaves SCRIPT_NAME alone? I think so. Ian From graham.dumpleton at gmail.com Fri Jan 25 00:12:15 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 25 Jan 2008 10:12:15 +1100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <000401c85ebb$35385a40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> Message-ID: <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> On 25/01/2008, Brian Smith wrote: > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to "GET" for HEAD requests. Middleware and applications will not be able to detect the difference between GET and HEAD requests. > > 2. For a HEAD request, A WSGI gateway must not iterate through the response iterable, but it must call the response iterable's close() method, if any. It must not send any output that was written via start_response(...).write() either. Consequently, WSGI applications must work correctly, and must not leak resources, when their output is not iterated; an application should not signal or log an error if the iterable's close() method is invoked without any iteration taking place. > > Please add this issue to http://wsgi.org/wsgi/WSGI_2.0. This would go against how things are done with Apache and could cause Apache to generate incorrect response headers for a HEAD request. The issue here is that Apache has its own output filtering system where filters can set headers based on the actual content. Because of this, any output filter must always receive the response content regardless of whether the request is a GET or HEAD. If an application handler tries to optimise things and not return the content, then these output filters may generate different headers for a HEAD request than a GET request, thereby violating the requirement that they should actually be the same. Note that response content is still thrown away for a HEAD request, it is just done at the very last moment after all Apache output filters have processed the data. Graham From brian at briansmith.org Fri Jan 25 03:51:21 2008 From: brian at briansmith.org (Brian Smith) Date: Thu, 24 Jan 2008 18:51:21 -0800 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> Message-ID: <004201c85efd$2dfdbe40$0401a8c0@T60> Graham Dumpleton wrote: > The issue here is that Apache has its own output filtering > system where filters can set headers based on the actual > content. Because of this, any output filter must always > receive the response content regardless of whether the > request is a GET or HEAD. If an application handler tries to > optimise things and not return the content, then these output > filters may generate different headers for a HEAD request > than a GET request, thereby violating the requirement that > they should actually be the same. > > Note that response content is still thrown away for a HEAD > request, it is just done at the very last moment after all > Apache output filters have processed the data. Right, that is exactly what I am saying. In Apache's documentation, it says that every handler should include the response entity for HEAD requests, so that output filters can process the output. However, there is nothing in PEP 333 that talks about this behavior. So, the only reasonable thing to do is to assume that, when environ["REQUEST_METHOD"] == "HEAD", no response entity should be generated. Do we all agree that the following application is correct?: def application(env, start_response): start_response("200 OK", [("Content-Length", "10000")]) if env["REQUEST_METHOD"] == "HEAD": return [] else: return ["a"*10000] Because of web servers' output filters, if the WSGI gateway is an web server module or a [Fast]CGI script, then it needs to lie and tell the application that the request is a "GET", not a "HEAD." Otherwise, the application will see that the request method is "HEAD" and suppress its own response entity, as the HTTP specification requires, and the output filters will fail. The only time it is reasonable for the gateway to pass "HEAD" as the request method is when it knows that there are not any output filters/middleware that depend on the response entity. Usually that is only possible in standalone web servers like CherryPy's or Paste's. I tested this in mod_wsgi and mod_wsgi gets it wrong. mod_wsgi sets env["REQUEST_METHOD"] to "HEAD" for HEAD requests. When mod_deflate is enabled, a HEAD request returns "Content-Length: 20", and a GET request returns "Content-Length: 46". However, it is supposed to be "Content-Length: 46" in both cases. The CGI WSGI gateway in PEP 333 gets it wrong too when mod_deflate is used. Note also that in mod_wsgi, use of wsgi.file_wrapper is a huge optimization for this: if no Apache output filters need the response entity, and wsgi.file_wrapper is used, then the file will never be read off the disk. But, if wsgi.file_wrapper is not used, then the entire file has to be read off the disk through the application's output iterable for no reason. It would be nice if the non-file_wrapper case worked as well as the file_wrapper case. If you put all this together, you end up with the rules that I outlined in my previous message: > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to > "GET" for HEAD requests. Middleware and applications will > not be able to detect the difference between GET and HEAD > requests. > > 2. For a HEAD request, A WSGI gateway must not iterate > through the response iterable, but it must call the > response iterable's close() method, if any. It must not > send any output that was written via > start_response(...).write() either. Consequently, > WSGI applications must work correctly, and must not > leak resources, when their output is not iterated; > an application should not signal or log an error if > the iterable's close() method is invoked without any > iteration taking place. - Brian From brian at briansmith.org Fri Jan 25 03:53:15 2008 From: brian at briansmith.org (Brian Smith) Date: Thu, 24 Jan 2008 18:53:15 -0800 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <4798E427.3090403@plope.com> References: <000401c85ebb$35385a40$0401a8c0@T60> <4798E427.3090403@plope.com> Message-ID: <004301c85efd$6ebf0b50$0401a8c0@T60> Chris McDonough wrote: > I have applications that do detect the difference between a > GET and a HEAD (they do slightly less work if the request is > a HEAD request), so I suspect this is not a totally > reasonable thing to add to the spec. Yes, of course. In order to avoid doing unnecessary work for a HEAD request, the extra work needs to be transferred to the response iterable; for a HEAD request, the gateway would skip the iterable except for its close() method, and so all the extra work is skipped as well. > Maybe instead the middleware that does what you're describing should be changed > instead to deal with HEAD requests. I agree. But, this problem is often overlooked by middleware, which indicates that we at least need an explanation of the problem in the specification. But, when the middleware are corrected, then applications like yours will only work efficiently if they transfer the extra work they do for GET (vs. HEAD) requests to the response iterable. > In general, I don't think is (or should be) any guarantee > that an arbitrary middleware stack will work with an > arbitrary application. Although that would be nice in > theory, I suspect it would require a very complex protocol > (more complex than what WSGI requires now). That is exactly what WSGI is designed for. There is no point to having a standard if there is no interoperability amongst compliant components. - Brian From ianb at colorstudy.com Fri Jan 25 04:30:50 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 Jan 2008 21:30:50 -0600 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <004201c85efd$2dfdbe40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> <004201c85efd$2dfdbe40$0401a8c0@T60> Message-ID: <479957EA.5090103@colorstudy.com> Brian Smith wrote: > Graham Dumpleton wrote: >> The issue here is that Apache has its own output filtering >> system where filters can set headers based on the actual >> content. Because of this, any output filter must always >> receive the response content regardless of whether the >> request is a GET or HEAD. If an application handler tries to >> optimise things and not return the content, then these output >> filters may generate different headers for a HEAD request >> than a GET request, thereby violating the requirement that >> they should actually be the same. >> >> Note that response content is still thrown away for a HEAD >> request, it is just done at the very last moment after all >> Apache output filters have processed the data. > > Right, that is exactly what I am saying. In Apache's documentation, it > says that every handler should include the response entity for HEAD > requests, so that output filters can process the output. However, there > is nothing in PEP 333 that talks about this behavior. Unlike Apache there are no output filters in WSGI; all middleware gets to adjust the request as well as the response. So middleware that can't handle a real HEAD request has an opportunity to turn it into a GET request. I don't see why PEP 333 needs to talk about this, to me it seems straight forward enough in a WSGI context, and PEP 333 can't cover every possible bug someone might introduce into their middleware. Ian From graham.dumpleton at gmail.com Fri Jan 25 04:43:54 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 25 Jan 2008 14:43:54 +1100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <004201c85efd$2dfdbe40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> <004201c85efd$2dfdbe40$0401a8c0@T60> Message-ID: <88e286470801241943m17e97303v7c7b33432bc0f39b@mail.gmail.com> On 25/01/2008, Brian Smith wrote: > Graham Dumpleton wrote: > > The issue here is that Apache has its own output filtering > > system where filters can set headers based on the actual > > content. Because of this, any output filter must always > > receive the response content regardless of whether the > > request is a GET or HEAD. If an application handler tries to > > optimise things and not return the content, then these output > > filters may generate different headers for a HEAD request > > than a GET request, thereby violating the requirement that > > they should actually be the same. > > > > Note that response content is still thrown away for a HEAD > > request, it is just done at the very last moment after all > > Apache output filters have processed the data. > > Right, that is exactly what I am saying. To quote, in 2 you said: """For a HEAD request, A WSGI gateway must not iterate through the response iterable""" I was presuming that this was saying that the WSGI gateway should do this as well as changing the REQUEST_METHOD actually sent to the WSGI application to GET. If Apache mod_wsgi (the WSGI gateway) does then do this, ie., didn't iterate through the iterable and therefore didn't return the content through to Apache, it would as explained cause traditional Apache output filters to potentially yield incorrect results. This is what I am highlighting. So Apache mod_wsgi couldn't avoid processing the iterable, unless as you allude to with how internals of how Apache is used to implement wsgi.file_wrapper support, that mod_wsgi similarly detected when no Apache output filters are registered that could add additional headers and skip the processing. Some clarification in 2 is perhaps required. > In Apache's documentation, it > says that every handler should include the response entity for HEAD > requests, so that output filters can process the output. However, there > is nothing in PEP 333 that talks about this behavior. So, the only > reasonable thing to do is to assume that, when environ["REQUEST_METHOD"] > == "HEAD", no response entity should be generated. Do we all agree that > the following application is correct?: > > def application(env, start_response): > start_response("200 OK", > [("Content-Length", "10000")]) > if env["REQUEST_METHOD"] == "HEAD": > return [] > else: > return ["a"*10000] > > Because of web servers' output filters, if the WSGI gateway is an web > server module or a [Fast]CGI script, then it needs to lie and tell the > application that the request is a "GET", not a "HEAD." Otherwise, the > application will see that the request method is "HEAD" and suppress its > own response entity, as the HTTP specification requires, and the output > filters will fail. The only time it is reasonable for the gateway to > pass "HEAD" as the request method is when it knows that there are not > any output filters/middleware that depend on the response entity. > Usually that is only possible in standalone web servers like CherryPy's > or Paste's. > > I tested this in mod_wsgi and mod_wsgi gets it wrong. mod_wsgi sets > env["REQUEST_METHOD"] to "HEAD" for HEAD requests. It just passes whatever Apache sets up as the CGI environment. > When mod_deflate is > enabled, a HEAD request returns "Content-Length: 20", and a GET request > returns "Content-Length: 46". However, it is supposed to be > "Content-Length: 46" in both cases. Is this with your sample application which detects HEAD and doesn't return anything if it is found. In other words, it is driven by what your application is actually returning? Am not saying your application is wrong or right, am just trying to determine if you are saying that there is a problem in Apache mod_wsgi separate to the what it is passing as REQUEST_METHOD to cause that. > The CGI WSGI gateway in PEP 333 gets > it wrong too when mod_deflate is used. > > Note also that in mod_wsgi, use of wsgi.file_wrapper is a huge > optimization for this: if no Apache output filters need the response > entity, and wsgi.file_wrapper is used, then the file will never be read > off the disk. Hmmm, I didn't actually look under the covers of what Apache did when I used its file bucket for that. Worked out better than I expected then. :-) > But, if wsgi.file_wrapper is not used, then the entire > file has to be read off the disk through the application's output > iterable for no reason. It would be nice if the non-file_wrapper case > worked as well as the file_wrapper case. > > If you put all this together, you end up with the rules that I outlined > in my previous message: Except as pointed out that 2 suggests I should never pass on content from iterable for HEAD, where in practice I still have to if there are output filters. Pardon me if I am not understanding very well, I did not get much sleep last night because of baby and my head hurts. :-( Graham > > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to > > "GET" for HEAD requests. Middleware and applications will > > not be able to detect the difference between GET and HEAD > > requests. > > > > 2. For a HEAD request, A WSGI gateway must not iterate > > through the response iterable, but it must call the > > response iterable's close() method, if any. It must not > > send any output that was written via > > start_response(...).write() either. Consequently, > > WSGI applications must work correctly, and must not > > leak resources, when their output is not iterated; > > an application should not signal or log an error if > > the iterable's close() method is invoked without any > > iteration taking place. > > - Brian > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From graham.dumpleton at gmail.com Fri Jan 25 04:52:25 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 25 Jan 2008 14:52:25 +1100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <479957EA.5090103@colorstudy.com> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> <004201c85efd$2dfdbe40$0401a8c0@T60> <479957EA.5090103@colorstudy.com> Message-ID: <88e286470801241952m50986066v90ac3913bb29bc62@mail.gmail.com> On 25/01/2008, Ian Bicking wrote: > Brian Smith wrote: > > Graham Dumpleton wrote: > >> The issue here is that Apache has its own output filtering > >> system where filters can set headers based on the actual > >> content. Because of this, any output filter must always > >> receive the response content regardless of whether the > >> request is a GET or HEAD. If an application handler tries to > >> optimise things and not return the content, then these output > >> filters may generate different headers for a HEAD request > >> than a GET request, thereby violating the requirement that > >> they should actually be the same. > >> > >> Note that response content is still thrown away for a HEAD > >> request, it is just done at the very last moment after all > >> Apache output filters have processed the data. > > > > Right, that is exactly what I am saying. In Apache's documentation, it > > says that every handler should include the response entity for HEAD > > requests, so that output filters can process the output. However, there > > is nothing in PEP 333 that talks about this behavior. > > Unlike Apache there are no output filters in WSGI; Well, the concept of output filters does exist in WSGI, they are just called something different. ;-) Anyway, the end result is the same, it is just that how they are modeled in the worlds of Apache and WSGI at the interface level is different. Graham > all middleware gets > to adjust the request as well as the response. So middleware that can't > handle a real HEAD request has an opportunity to turn it into a GET > request. I don't see why PEP 333 needs to talk about this, to me it > seems straight forward enough in a WSGI context, and PEP 333 can't cover > every possible bug someone might introduce into their middleware. > > Ian > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From brian at briansmith.org Fri Jan 25 05:05:59 2008 From: brian at briansmith.org (Brian Smith) Date: Thu, 24 Jan 2008 20:05:59 -0800 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <88e286470801241943m17e97303v7c7b33432bc0f39b@mail.gmail.com> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> <004201c85efd$2dfdbe40$0401a8c0@T60> <88e286470801241943m17e97303v7c7b33432bc0f39b@mail.gmail.com> Message-ID: <005001c85f07$98b66840$0401a8c0@T60> Graham Dumpleton wrote: > To quote, in 2 you said: > > """For a HEAD request, A WSGI gateway must not iterate > through the response iterable""" > > I was presuming that this was saying that the WSGI gateway > should do this as well as changing the REQUEST_METHOD > actually sent to the WSGI application to GET. I misstated it. It should be "For a HEAD request, A WSGI gateway *may* skip iterating through the response iterable". That is, if the gateway can detect that the request entity isn't going to change the final set of headers in any way, it can skip the iteration. > If Apache mod_wsgi (the WSGI gateway) does then do this, ie., > didn't iterate through the iterable and therefore didn't > return the content through to Apache, it would as explained > cause traditional Apache output filters to potentially yield > incorrect results. This is what I am highlighting. > > So Apache mod_wsgi couldn't avoid processing the iterable, > unless as you allude to with how internals of how Apache is > used to implement wsgi.file_wrapper support, that mod_wsgi > similarly detected when no Apache output filters are > registered that could add additional headers and skip the processing. Right, my idea was that mod_wsgi could implement a new bucket type, where the iteration is done if and only if some output filter reads from the bucket. But, if no output filters read from the bucket, then the iteration would never happen. > > def application(env, start_response): > > start_response("200 OK", > > [("Content-Length", "10000")]) > > if env["REQUEST_METHOD"] == "HEAD": > > return [] > > else: > > return ["a"*10000] > > > > I tested this in mod_wsgi and mod_wsgi gets it wrong. mod_wsgi sets > > env["REQUEST_METHOD"] to "HEAD" for HEAD requests. > > It just passes whatever Apache sets up as the CGI environment. > > > When mod_deflate is > > enabled, a HEAD request returns "Content-Length: 20", and a GET > > request returns "Content-Length: 46". However, it is supposed to be > > "Content-Length: 46" in both cases. > > Is this with your sample application which detects HEAD and > doesn't return anything if it is found. In other words, it is > driven by what your application is actually returning? Yes, these results are from the program above. Those 10,000 A's compress down to 26 bytes, plus the 20 byte header. For the HEAD case, mod_deflate compresses 0 bytes to 0 bytes and adds a 20 byte header. > > Note also that in mod_wsgi, use of wsgi.file_wrapper is a huge > > optimization for this: if no Apache output filters need the > > response entity, and wsgi.file_wrapper is used, then the file > > will never be read off the disk. > > Hmmm, I didn't actually look under the covers of what Apache > did when I used its file bucket for that. Worked out better > than I expected then. :-) I will double-check, but I believe that in the embedded mode, the file never gets read at all, when there are no output filters processing the output. I will bring it up on the mod_wsgi list. > Except as pointed out that 2 suggests I should never pass on > content from iterable for HEAD, where in practice I still > have to if there are output filters. > > Pardon me if I am not understanding very well, I did not get > much sleep last night because of baby and my head hurts. :-( Not your (or your daughter's) fault; I wrote something different from what I meant. I hope tonight is easier on you. Good luck! Regards, Brian From graham.dumpleton at gmail.com Fri Jan 25 05:14:37 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 25 Jan 2008 15:14:37 +1100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <005001c85f07$98b66840$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> <88e286470801241512j44d27d14lc0c433078187628b@mail.gmail.com> <004201c85efd$2dfdbe40$0401a8c0@T60> <88e286470801241943m17e97303v7c7b33432bc0f39b@mail.gmail.com> <005001c85f07$98b66840$0401a8c0@T60> Message-ID: <88e286470801242014qdf043fcj413200809ef4cdaa@mail.gmail.com> On 25/01/2008, Brian Smith wrote: > Graham Dumpleton wrote: > > If Apache mod_wsgi (the WSGI gateway) does then do this, ie., > > didn't iterate through the iterable and therefore didn't > > return the content through to Apache, it would as explained > > cause traditional Apache output filters to potentially yield > > incorrect results. This is what I am highlighting. > > > > So Apache mod_wsgi couldn't avoid processing the iterable, > > unless as you allude to with how internals of how Apache is > > used to implement wsgi.file_wrapper support, that mod_wsgi > > similarly detected when no Apache output filters are > > registered that could add additional headers and skip the processing. > > Right, my idea was that mod_wsgi could implement a new bucket type, > where the iteration is done if and only if some output filter reads from > the bucket. But, if no output filters read from the bucket, then the > iteration would never happen. Unfortunately as I think I mentioned on mod_wsgi list previously, that may not be trivial. :-) > > Pardon me if I am not understanding very well, I did not get > > much sleep last night because of baby and my head hurts. :-( > > Not your (or your daughter's) fault; I wrote something different from > what I meant. Okay, clearer now. > I hope tonight is easier on you. Good luck! I hope so too. Am going home early now, but the boss will probably not allow me to read email for a couple of days until I am fully recovered, so you'll probably not hear from me more on this issue. I certainly understand what you are saying and the potential need for it, so will be interesting to see what final consensus is. Graham From brian at briansmith.org Fri Jan 25 16:04:37 2008 From: brian at briansmith.org (Brian Smith) Date: Fri, 25 Jan 2008 07:04:37 -0800 Subject: [Web-SIG] environ["wsgi.input"].read() Message-ID: <00a601c85f63$9ae4dc30$0401a8c0@T60> 1. PEP 333 doesn't indicate that the size parameter for the read() method is optional. Is it optional or required? If it is optional, is the default value -1? 2. What are the semantics of environ["wsgi.input"].read(-1) when Content-Length is provided? Is it guaranteed to return the entire request entity, up to at most bytes? 3. What are the semantics of environ["wsgi.input"].read(-1) when the response has no Content-Length? Can environ["wsgi.input"].read(-1) be used (as the only available mechanism) to read a chunked response entity? Putting all this together, are these two programs correct?: def application(environ, start_response): start_response("200 OK", []) yield environ["wsgi.input"].read() def application(environ, start_response): start_response("200 OK", []) yield environ["wsgi.input"].read(-1) This is another issue where there is a lot of variance between gateways, where I think a clarification in the specification is needed. - Brian From foom at fuhm.net Fri Jan 25 18:20:57 2008 From: foom at fuhm.net (James Y Knight) Date: Fri, 25 Jan 2008 12:20:57 -0500 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <00a601c85f63$9ae4dc30$0401a8c0@T60> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> Message-ID: On Jan 25, 2008, at 10:04 AM, Brian Smith wrote: > 1. PEP 333 doesn't indicate that the size parameter for the read() > method is optional. Is it optional or required? If it is optional, > is the default value -1? The spec says it's required (by virtue of not saying it's optional) > 2. What are the semantics of environ["wsgi.input"].read(-1) when > Content-Length is provided? Is it guaranteed to return the entire > request entity, up to at most bytes? There is no such guarantee written in the spec, so you should assume it's not guaranteed. > 3. What are the semantics of environ["wsgi.input"].read(-1) when the > response has no Content-Length? Can environ["wsgi.input"].read(-1) > be used (as the only available mechanism) to read a chunked response > entity? The CGI specification, and thus WSGI by implication, doesn't allow for chunked input. The CONTENT_LENGTH environment key is a required value if there is content. The only correct thing for a gateway to do is to reject a request with chunked input. > Putting all this together, are these two programs correct?: > > def application(environ, start_response): > start_response("200 OK", []) > yield environ["wsgi.input"].read() > > def application(environ, start_response): > start_response("200 OK", []) > yield environ["wsgi.input"].read(-1) No, they rely on non-standard behavior. > This is another issue where there is a lot of variance between > gateways, where I think a clarification in the specification is > needed. The spec is fairly clear as to what you can rely on here. Additional behavior may of course be implemented in some gateway, but it's going to be non-portable. James From brian at briansmith.org Fri Jan 25 19:23:24 2008 From: brian at briansmith.org (Brian Smith) Date: Fri, 25 Jan 2008 10:23:24 -0800 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: References: <00a601c85f63$9ae4dc30$0401a8c0@T60> Message-ID: <00ba01c85f7f$5fd41900$0401a8c0@T60> James Y Knight wrote: > On Jan 25, 2008, at 10:04 AM, Brian Smith wrote: > > 1. PEP 333 doesn't indicate that the size parameter for the read() > > method is optional. Is it optional or required? If it is > > optional, is the default value -1? > > The spec says it's required (by virtue of not saying it's optional) I would agree, except PEP 333 also says it is a file-like object. The definition of file-like object at http://docs.python.org/lib/bltin-file-objects.html implies that the size parameter is optional. Note that the behaviors that are optional for file-like objects are in a different section than the one that defines the read() method with the optional parameter. > The CGI specification, and thus WSGI by implication, > doesn't allow for chunked input. The CONTENT_LENGTH > environment key is a required value if there is > content. The only correct thing for a gateway to > do is to reject a request with chunked input. The gateway can also decode the chunked entity and calculate the Content-Length before passing it on to the application. > The spec is fairly clear as to what you can rely on here. > Additional behavior may of course be implemented in some > gateway, but it's going to be non-portable. I disagree about the clarity of the spec. But, I agree that applications should not rely on the handling of a negative or missing size parameter. - Brian From graham.dumpleton at gmail.com Sun Jan 27 07:44:42 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Sun, 27 Jan 2008 17:44:42 +1100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <00a601c85f63$9ae4dc30$0401a8c0@T60> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> Message-ID: <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> On 26/01/2008, Brian Smith wrote: > 1. PEP 333 doesn't indicate that the size parameter for the read() method is optional. Is it optional or required? If it is optional, is the default value -1? > > 2. What are the semantics of environ["wsgi.input"].read(-1) when Content-Length is provided? Is it guaranteed to return the entire request entity, up to at most bytes? > > 3. What are the semantics of environ["wsgi.input"].read(-1) when the response has no Content-Length? Can environ["wsgi.input"].read(-1) be used (as the only available mechanism) to read a chunked response entity? > > Putting all this together, are these two programs correct?: > > def application(environ, start_response): > start_response("200 OK", []) > yield environ["wsgi.input"].read() > > def application(environ, start_response): > start_response("200 OK", []) > yield environ["wsgi.input"].read(-1) > > This is another issue where there is a lot of variance between gateways, where I think a clarification in the specification is needed. I have brought up the issue of chunked encoding and mutating input filters previously, whether they be implemented in Apache or as WSGI middleware. For the outcome of that discussion see: http://groups.google.com/group/python-web-sig/browse_frm/thread/25bf70b49a90e0c0 As to your questions about read() with no argument, or with traditional Python file like object default of -1, the only WSGI server/adapter I know of where this will NOT work as one would expect, ie., read remainder of request content, is the CherryPy WSGI adapter. As far as I know it works fine with Apache CGI WSGI adapters, Apache mod_wsgi, plus SCGI, FASTCGI and AJP adapters via flup, as well as with paste WSGI server. Not sure what wsgiref will do though. The reason it doesn't work with CherryPy WSGI server comes down to the problem I highlighted recently. That was the questions I posed in: http://groups.google.com/group/python-web-sig/browse_frm/thread/e46e72cc812870c6 about WSGI adapters not discarding request content which was not consumed. What it all comes down to is that CherryPy WSGI server, unless it has changed, chooses not to simulate EOF as per: """The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.""" from specification. It is because it just supplies the socket as wsgi.input that it can't do this and that it doesn't do this also leads to the problems with it not being able to discard request content which wasn't consumed, thereby causing problems when request pipelining is occurring as the unconsumed input gets interpreted as the headers of the subsequent request. In contrast, the paste server wraps any actual socket in LimitedLengthFile which simulates EOF but also allows how much content is remaining to be tracked and thus allowing it to be discarded at the end of the request if not consumed. If the WSGI specification simply required that EOF be simulated then read() with no arguments, or -1 argument, could mean return all remaining content with absolutely no problems. Implementations would also naturally lend themselves to dealing with unconsumed input correctly. This would subsequently also allow mutating input filters which change the content length, which could then be flagged by setting Content-Length header to -1. What this still doesn't solve is chunked request content. But then, I don't believe the existing read() method is suitable for that, as what you want with chunked request content, is not return me all input, but return me the next available chunk. As such, some sort of separate abstraction may be required for dealing with chunked request content, using a special argument to read() just isn't going to work. Anyway, in the past, as with many issues it seems people just want to shove this all to be worried about in WSGI 2.0 rather than actually trying to fix all the inconsistencies and sub optimal stuff in WSGI 1.0. All in all I can appreciate the problems some feel in respect of trying to write a true portable WSGI application. If you keep to the core stuff all is okay, start to do complex stuff where the PEP isn't perhaps well defined and you start to run into problems as to what it means and whether it is actually portable. Waiting for WSGI 2.0 isn't really an option since it isn't even going to be interface compatible and frankly may never get done anyway because people will think 1.0 is good enough even if it is not as good as it could be. Because I still feel that these details should be fixed prior to WSGI 2.0, am going to add this and some of the other issues raised recently to: http://www.wsgi.org/wsgi/Amendments_1.0 Graham From manlio_perillo at libero.it Sun Jan 27 12:31:49 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Sun, 27 Jan 2008 12:31:49 +0100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> Message-ID: <479C6BA5.4050401@libero.it> Graham Dumpleton ha scritto: > [...] > > > I have brought up the issue of chunked encoding and mutating input > filters previously, whether they be implemented in Apache or as WSGI > middleware. For the outcome of that discussion see: > > http://groups.google.com/group/python-web-sig/browse_frm/thread/25bf70b49a90e0c0 > > [...] This is just a suggestion, but what about "requiring" that a WSGI implementation calls the WSGI application only when all the request body has been read? Moreover, it should be possible to register a filter function that will be called when the server reads each chunk of input. With Nginx, as an example, I have choosen this solution, and in fact the wsgi.input is a cString or File object. Unfortunately Nginx still does not implements input filters. Manlio Perillo From lbruno at 100blossoms.com Sun Jan 27 13:56:29 2008 From: lbruno at 100blossoms.com (Luis Bruno) Date: Sun, 27 Jan 2008 12:56:29 +0000 Subject: [Web-SIG] URL quoting in WSGI (or the lack therof) In-Reply-To: <47967C38.3090507@latte.ca> References: <95513744-FD79-49C4-AD79-165A7BEDFC90@groovie.org> <47920B51.3050201@100blossoms.com> <4793F5AC.3030905@colorstudy.com> <4795D2BE.3000805@100blossoms.com> <001801c85d16$1a3abd60$0501a8c0@T60> <47962FAB.3060108@100blossoms.com> <83C62C33-AFE9-48A8-94A0-5B14A0E3E95E@fuhm.net> <7555ca2e0801221433i206fa746reb6fc6a702e25ef2@mail.gmail.com> <47967C38.3090507@latte.ca> Message-ID: <479C7F7D.1050104@100blossoms.com> Hello, it's me again, Phillip J. Eby wrote: > MoinMoin, for example, has its own encoding scheme for handling > pseudo-slashes in paths, and IMO it's a better way to handle it than > trying to rely on finding a server that supports *not* decoding URLs. I had the abstract knowledge that CGI is still used for deployment, but growing up with application servers must have spoiled me. Still, I think nothing stops mod_wsgi passing an encoded URL down to my apps but for adherence to the CGI spec. I've never checked it, nor the ajp + flup combination. Something more for the todo pile. On the short run I'll $2F my slashes. I can't actually use %252F, because everyone seems to think they'll either get an encoded URL to unquote() or that unquote(unquote()) is a no-op: Routes was not alone in this. Blake Winton wrote: > I respectfully disagree. I've been using %-escapes in urls for years, > intending that they get unescaped before being passed to > applications... %7E instead of ~ mainly. > > in XML you can't tell the difference between and < > and < You've given an example of separate ways to escape the same '<' character, and I agree that you shouldn't have to distinguish between them. But XML does treat '<' differently from '<': if you just want to write a '<' instead of starting a tag, you need to escape it. I don't want my SAX code[*] to deal with all the different ways to write a literal '<'. But I expect a " in urls I would expect the url parser to unescape things, and pass you > the unescaped data. Yeah, me too. I just don't want to lose information: "this was a literal slash, not an hierarchy delimiter". But if the framework splits on the real slashes and *then* unquotes each segment, I'd be happy to get that list of unquoted segments. This way, my URLs use the obvious way to escape slashes and by the time it gets to my code I have unescaped data. This could be "dealt with" by using a REQUEST_URI instead. But then I have to manually trim the components that URL dispatching moved into SCRIPT_NAME. And I don't actually *have* a REQUEST_URI in the environ. Ian Bicking wrote: > distinguishing %2f and / is more of a corner case I'll call it a canary in the URL mine. Should you have to balance '{' and '}' to find the quoted namespaces for GData terms? I haven't touched GData, but .split('/') and *then* unquoting looks like what's exactly needed in that case. Thank you, -- Luis Bruno From brian at briansmith.org Sun Jan 27 20:47:03 2008 From: brian at briansmith.org (Brian Smith) Date: Sun, 27 Jan 2008 11:47:03 -0800 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> Message-ID: <000001c8611d$678e1610$0501a8c0@T60> Graham Dumpleton wrote: > On 26/01/2008, Brian Smith wrote: > As to your questions about read() with no argument, or with > traditional Python file like object default of -1, the only > WSGI server/adapter I know of where this will NOT work as one > would expect, ie., read remainder of request content, is the > CherryPy WSGI adapter. > > As far as I know it works fine with Apache CGI WSGI adapters, > Apache mod_wsgi, plus SCGI, FASTCGI and AJP adapters via > flup, as well as with paste WSGI server. Not sure what > wsgiref will do though. It doesn't work on mod_wsgi either. When I tried it, it only returned 8000 bytes of the input. That is why I started this thread in the first place, actually. If this isn't the behavior you expected, I will file a bug with a test case. (Google Code doesn't allow for attachments to bug reports too, maybe I will create my own "WSGI testcases" project on Google Code to store them all in SVN.) > If the WSGI specification simply required that EOF be simulated then > read() with no arguments, or -1 argument, could mean return > all remaining content with absolutely no problems. > Implementations would also naturally lend themselves to > dealing with unconsumed input correctly. It is too late for WSGI 1.0. The best we can do is say that WSGI gateways and middleware should implement read() like this, but WSGI applications and middleware should not depend on it. > This would subsequently also allow mutating input filters > which change the content length, which could then be flagged > by setting Content-Length header to -1. This has to wait until a new version of WSGI. Too many applications are written with an expectation of a non-negative Content-Length. > What this still doesn't solve is chunked request content. But > then, I don't believe the existing read() method is suitable > for that, as what you want with chunked request content, is > not return me all input, but return me the next available > chunk. As such, some sort of separate abstraction may be > required for dealing with chunked request content, using a > special argument to read() just isn't going to work. I agree that a non-blocking variant of read() would be very useful. > Anyway, in the past, as with many issues it seems people just > want to shove this all to be worried about in WSGI 2.0 rather > than actually trying to fix all the inconsistencies and sub > optimal stuff in WSGI 1.0. This issue isn't critical like the GET vs. HEAD issue. WSGI applications can easily work around this issue by simply always supplying a non-negative size argument to read(). The GET/HEAD issue is so tedious to work around that it really needs to be addressed in PEP 333. > All in all I can appreciate the problems some feel in respect > of trying to write a true portable WSGI application. If you > keep to the core stuff all is okay, start to do complex stuff > where the PEP isn't perhaps well defined and you start to run > into problems as to what it means and whether it is actually > portable. Waiting for WSGI 2.0 isn't really an option since > it isn't even going to be interface compatible and frankly > may never get done anyway because people will think 1.0 is > good enough even if it is not as good as it could be. The main problems I have run into are the GET/HEAD issue, and problems with gateways that cannot handle applications that do not read (enough of the) the request body. These are both issues where the example CGI WSGI gateway in the PEP is inadequate, and the inadequacy of the example gateway has spread to other implementations that have overlooked the same issues. It does seem like there is a lot of resistance to modifying PEP 333, even though it is just a draft. There are a lot of benefits to having a feature freeze for WSGI 1.0. But, it is also advantageous to remove any ambiguities in the PEP. In particular, I don't see any disadvantages to adding a statement that the behavior of read() is only well defined when a nonnegative size argument is supplied. - Brian From graham.dumpleton at gmail.com Sun Jan 27 23:10:55 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 28 Jan 2008 09:10:55 +1100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <000001c8611d$678e1610$0501a8c0@T60> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <000001c8611d$678e1610$0501a8c0@T60> Message-ID: <88e286470801271410y16baec57h41e7dd0116f7a3f3@mail.gmail.com> On 28/01/2008, Brian Smith wrote: > Graham Dumpleton wrote: > > On 26/01/2008, Brian Smith wrote: > > As to your questions about read() with no argument, or with > > traditional Python file like object default of -1, the only > > WSGI server/adapter I know of where this will NOT work as one > > would expect, ie., read remainder of request content, is the > > CherryPy WSGI adapter. > > > > As far as I know it works fine with Apache CGI WSGI adapters, > > Apache mod_wsgi, plus SCGI, FASTCGI and AJP adapters via > > flup, as well as with paste WSGI server. Not sure what > > wsgiref will do though. > > It doesn't work on mod_wsgi either. When I tried it, it only returned > 8000 bytes of the input. That is why I started this thread in the first > place, actually. If this isn't the behavior you expected, I will file a > bug with a test case. (Google Code doesn't allow for attachments to bug > reports too, maybe I will create my own "WSGI testcases" project on > Google Code to store them all in SVN.) Whoops, you are right. Very very early development versions of mod_wsgi would read everything, but for some reason I changed tack as was experimenting with read() with no argument returning only what was actually available at that time, possibly to see how input chunking could work. Ie., simulating a non blocking read. I thought I had put it back to read everything. Obviously I completely forgot what I was doing. :-( Graham From graham.dumpleton at gmail.com Sun Jan 27 23:12:25 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 28 Jan 2008 09:12:25 +1100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <479C6BA5.4050401@libero.it> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <479C6BA5.4050401@libero.it> Message-ID: <88e286470801271412x3bc3a872t40301cc145c342e@mail.gmail.com> On 27/01/2008, Manlio Perillo wrote: > Graham Dumpleton ha scritto: > > [...] > > > > > > I have brought up the issue of chunked encoding and mutating input > > filters previously, whether they be implemented in Apache or as WSGI > > middleware. For the outcome of that discussion see: > > > > http://groups.google.com/group/python-web-sig/browse_frm/thread/25bf70b49a90e0c0 > > > > > [...] > > > This is just a suggestion, but what about "requiring" that a WSGI > implementation calls the WSGI application only when all the request body > has been read? Can't do that. The input content could be dependent on partial response content which has already been returned by the WSGI application. Ie., something which streams in both ways. Graham From manlio_perillo at libero.it Sun Jan 27 23:19:51 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Sun, 27 Jan 2008 23:19:51 +0100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <88e286470801271412x3bc3a872t40301cc145c342e@mail.gmail.com> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <479C6BA5.4050401@libero.it> <88e286470801271412x3bc3a872t40301cc145c342e@mail.gmail.com> Message-ID: <479D0387.10903@libero.it> Graham Dumpleton ha scritto: > On 27/01/2008, Manlio Perillo wrote: >> Graham Dumpleton ha scritto: >>> [...] >> > >>> I have brought up the issue of chunked encoding and mutating input >>> filters previously, whether they be implemented in Apache or as WSGI >>> middleware. For the outcome of that discussion see: >>> >>> http://groups.google.com/group/python-web-sig/browse_frm/thread/25bf70b49a90e0c0 >>> >> > [...] >> >> >> This is just a suggestion, but what about "requiring" that a WSGI >> implementation calls the WSGI application only when all the request body >> has been read? > > Can't do that. The input content could be dependent on partial > response content which has already been returned by the WSGI > application. Ie., something which streams in both ways. > Can you make an example of this use case? > Graham > Thanks Manlio Perillo From brian at briansmith.org Sun Jan 27 23:26:30 2008 From: brian at briansmith.org (Brian Smith) Date: Sun, 27 Jan 2008 14:26:30 -0800 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <479D0387.10903@libero.it> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <479C6BA5.4050401@libero.it> <88e286470801271412x3bc3a872t40301cc145c342e@mail.gmail.com> <479D0387.10903@libero.it> Message-ID: <000c01c86133$abca08f0$0501a8c0@T60> Manlio Perillo wrote: > Graham Dumpleton ha scritto: > > On 27/01/2008, Manlio Perillo wrote: > >> This is just a suggestion, but what about "requiring" that a WSGI > >> implementation calls the WSGI application only when all > >> the request body has been read? > > > > Can't do that. The input content could be dependent on partial > > response content which has already been returned by the WSGI > > application. Ie., something which streams in both ways. > > Can you make an example of this use case? PEP 333 allows the WSGI gateway to buffer the input if it chooses to do so: "The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference." - Brian From graham.dumpleton at gmail.com Sun Jan 27 23:36:03 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 28 Jan 2008 09:36:03 +1100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <000c01c86133$abca08f0$0501a8c0@T60> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <479C6BA5.4050401@libero.it> <88e286470801271412x3bc3a872t40301cc145c342e@mail.gmail.com> <479D0387.10903@libero.it> <000c01c86133$abca08f0$0501a8c0@T60> Message-ID: <88e286470801271436j7832ebdanff6f1ceadd7ef961@mail.gmail.com> On 28/01/2008, Brian Smith wrote: > Manlio Perillo wrote: > > Graham Dumpleton ha scritto: > > > On 27/01/2008, Manlio Perillo wrote: > > >> This is just a suggestion, but what about "requiring" that a WSGI > > >> implementation calls the WSGI application only when all > > >> the request body has been read? > > > > > > Can't do that. The input content could be dependent on partial > > > response content which has already been returned by the WSGI > > > application. Ie., something which streams in both ways. > > > > Can you make an example of this use case? > > PEP 333 allows the WSGI gateway to buffer the input if it chooses to do > so: "The server or gateway may perform reads on-demand as requested by > the application, or it may pre- read the client's request body and > buffer it in-memory or on disk, or use any other technique for providing > such an input stream, according to its preference." But doing that sort of defeats the purpose of using chunked input and would disallow use cases where delivery of the request content is delivered in distinct blocks over an extended time but must be consumed immediately and not when all is available. This is on top of use case always mentioned whereby future request content may be dependent in some way of response content already delivered back to the client. I know these are theoretical cases and people using these techniques may be few and far between, but didn't want to be putting limitations on things. Graham From graham.dumpleton at gmail.com Sun Jan 27 23:41:45 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 28 Jan 2008 09:41:45 +1100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <88e286470801271410y16baec57h41e7dd0116f7a3f3@mail.gmail.com> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <000001c8611d$678e1610$0501a8c0@T60> <88e286470801271410y16baec57h41e7dd0116f7a3f3@mail.gmail.com> Message-ID: <88e286470801271441x78b4508ay25c71e09fb8064f5@mail.gmail.com> On 28/01/2008, Graham Dumpleton wrote: > On 28/01/2008, Brian Smith wrote: > > Graham Dumpleton wrote: > > > On 26/01/2008, Brian Smith wrote: > > > As to your questions about read() with no argument, or with > > > traditional Python file like object default of -1, the only > > > WSGI server/adapter I know of where this will NOT work as one > > > would expect, ie., read remainder of request content, is the > > > CherryPy WSGI adapter. > > > > > > As far as I know it works fine with Apache CGI WSGI adapters, > > > Apache mod_wsgi, plus SCGI, FASTCGI and AJP adapters via > > > flup, as well as with paste WSGI server. Not sure what > > > wsgiref will do though. > > > > It doesn't work on mod_wsgi either. When I tried it, it only returned > > 8000 bytes of the input. That is why I started this thread in the first > > place, actually. If this isn't the behavior you expected, I will file a > > bug with a test case. (Google Code doesn't allow for attachments to bug > > reports too, maybe I will create my own "WSGI testcases" project on > > Google Code to store them all in SVN.) > > Whoops, you are right. > > Very very early development versions of mod_wsgi would read > everything, but for some reason I changed tack as was experimenting > with read() with no argument returning only what was actually > available at that time, possibly to see how input chunking could work. > Ie., simulating a non blocking read. I thought I had put it back to > read everything. Obviously I completely forgot what I was doing. :-( Okay, what I am going to do with Apache mod_wsgi is simply enforce the requirement that an argument be supplied to read(). After all, the WSGI PEP says there has to be an argument. It will be interesting to see whether any WSGI applications or commonly used modules then start failing. If some major modules start failing, it will at least show that not allowing the argument to be optional may have to be reviewed. :-) Graham From graham.dumpleton at gmail.com Sun Jan 27 23:59:03 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 28 Jan 2008 09:59:03 +1100 Subject: [Web-SIG] HEAD requests, WSGI gateways, and middleware In-Reply-To: <000401c85ebb$35385a40$0401a8c0@T60> References: <000401c85ebb$35385a40$0401a8c0@T60> Message-ID: <88e286470801271459w84b8b21yd9afd45310d31ac@mail.gmail.com> On 25/01/2008, Brian Smith wrote: > My application correctly responds to HEAD requests as-is. However, it doesn't work with middleware that sets headers based on the content of the response body. > > For example, a gateway or middleware that sets ETag based on an checksum, Content-Encoding, Content-Length and/or Content-MD5 will all result in wrong results by default. Right now, my applications assume that any such gateway or the first such middleware will change environ["REQUEST_METHOD"] from "HEAD" to "GET" before the application is invoked, and discard the response body that the application generates. > > However, many gateways and middleware do not do this, and PEP 333 doesn't have anything to say about it. As a result, a 100% WSGI 1.0-compliant application is not portable between gateways. > > I suggest that a revision of PEP 333 should require the following behavior: > > 1. WSGI gateways must always set environ["REQUEST_METHOD"] to "GET" for HEAD requests. Middleware and applications will not be able to detect the difference between GET and HEAD requests. > > 2. For a HEAD request, A WSGI gateway must not iterate through the response iterable, but it must call the response iterable's close() method, if any. It must not send any output that was written via start_response(...).write() either. Consequently, WSGI applications must work correctly, and must not leak resources, when their output is not iterated; an application should not signal or log an error if the iterable's close() method is invoked without any iteration taking place. For this discussion, which I see that there was no further followups, I see no choice but in Apache mod_wsgi to do number 1 above. It is the only way that one can guarantee that things will work properly due to the fact that Apache has its own output filtering system whereby output headers can be set based on the actual request content. If not done then the result of GET and HEAD may not be the same. As to number 2 (with later clarification), I will defer trying to do any optimisation by virtue of skipping processing of the iterable. This is in part because of the issue of whether a WSGI adapter is allowed to skip processing the iterable, but also because it gets a bit tricky in Apache mod_wsgi daemon mode as you need to pass across information from Apache child process to daemon process indicating whether there are any output filters registered in the Apache child process. Only knowing that could you skip processing the iterable in the daemon process and not generate any content. Overall I think the basic problem here is that in WSGI it likes to think it is the sole arbiter on what the response headers will be. In practice this may not be the case where one is bridging from a true web server which is capable of doing a lot of other stuff. For a WSGI adapter where this can occur, seems there isn't a choice for it to change all HEAD requests to GET requests. So, although I can fix Apache mod_wsgi so that HEAD works, this will not help with other Apache solutions such as CGI, SCGI, FASTCGI, AJP etc. For those the WSGI adapters used will have to be separately fixed to do a similar thing. Graham From manlio_perillo at libero.it Mon Jan 28 12:55:38 2008 From: manlio_perillo at libero.it (Manlio Perillo) Date: Mon, 28 Jan 2008 12:55:38 +0100 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <4798F8D7.1010002@colorstudy.com> References: <47989F1D.7080802@libero.it> <4798F8D7.1010002@colorstudy.com> Message-ID: <479DC2BA.5020905@libero.it> Ian Bicking ha scritto: > > [...] > >> 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a >> list. >> >> This means that the request uri recostruction must be changed: >> SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) > > I suppose you could leave stuff on PATH_INFO. But that doesn't seem to > fit with the idea of PATH_INFO. Also, will it be strictly > SCRIPT_NAME/consumed_path/PATH_INFO, or could it be > SCRIPT_NAME/consumed_path/some_other_parsing/consumed_path/PATH_INFO -- > after all, there's cases where stuff gets pushed from PATH_INFO to > SCRIPT_NAME, and if consumed_path is in between, which one do you push > stuff to? > What do you intend by some_other_parsing? >> 2) Store a wsgiorg.original_script_name, with the value seen by the >> routing application. > > I guess I usually do something like this, typically storing > myapp.base_path for use when I am generation application-absolute URLs > (like /logout). Then at the first chance (before running any kind of > routing) I do "environ['myapp.base_path'] = environ['SCRIPT_NAME']". > Thanks, this seems an easy solution. > This ad hoc technique works fine, but is very ad hoc. I'm not sure what > the best way to handle this is, really. I'm not sure there's a singular > root for an entire request, if you are nesting applications, so a single > key (wsgiorg.original_script_name) doesn't seem quite right. > Right, this is a problem: what "root" actually means. > I can't remember what Routes does for URL generation. Maybe it leaves > SCRIPT_NAME alone? I think so. > > Ian > Manlio Perillo From and-py at doxdesk.com Mon Jan 28 12:51:37 2008 From: and-py at doxdesk.com (Andrew Clover) Date: Mon, 28 Jan 2008 12:51:37 +0100 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <479C6BA5.4050401@libero.it> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com> <479C6BA5.4050401@libero.it> Message-ID: <479DC1C9.9070205@doxdesk.com> Manlio Perillo wrote: > what about "requiring" that a WSGI implementation calls the WSGI application > only when all the request body has been read? Regardless of the discussed technical issues, 'no thanks' - this would make it impossible to write - to choose an example from production code - a large-file-upload handler that allows upload progress to be checked during the process. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From brian at briansmith.org Mon Jan 28 13:39:51 2008 From: brian at briansmith.org (Brian Smith) Date: Mon, 28 Jan 2008 04:39:51 -0800 Subject: [Web-SIG] environ["wsgi.input"].read() In-Reply-To: <479DC1C9.9070205@doxdesk.com> References: <00a601c85f63$9ae4dc30$0401a8c0@T60> <88e286470801262244wad51e75m6c4c874126c42d3a@mail.gmail.com><479C6BA5.4050401@libero.it> <479DC1C9.9070205@doxdesk.com> Message-ID: <003f01c861aa$e29661e0$0501a8c0@T60> Andrew Clover wrote: > Manlio Perillo wrote: > > what about "requiring" that a WSGI implementation calls the WSGI > > application only when all the request body has been read? > > Regardless of the discussed technical issues, 'no thanks' - > this would make it impossible to write - to choose an example > from production code > - a large-file-upload handler that allows upload progress to > be checked during the process. This is already impossible to do *portably* with WSGI, because PEP 333 already allows the gateway to cache the request body at its own discretion. The last time I checked, Lighttpd and nginx both cached the entire request body before handing it off to back end [Fast|S]CGI processes. Most web accelerator proxies work the same way. I do agree that it should not be *required* behavior, but it should be *allowed* behavior. - Brian From ianb at colorstudy.com Tue Jan 29 06:01:36 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 28 Jan 2008 23:01:36 -0600 Subject: [Web-SIG] wsgiorg.routing_args and original SCRIPT_NAME In-Reply-To: <479DC2BA.5020905@libero.it> References: <47989F1D.7080802@libero.it> <4798F8D7.1010002@colorstudy.com> <479DC2BA.5020905@libero.it> Message-ID: <479EB330.7050400@colorstudy.com> Manlio Perillo wrote: > Ian Bicking ha scritto: > > >> [...] >> >>> 1) Do not change SCRIPT_NAME, and instead add a wsgiorg.consumed_path, a >>> list. >>> >>> This means that the request uri recostruction must be changed: >>> SCRIPT_NAME = SCRIPT_NAME + '/'.join(wsgiorg.consumed_path) >> >> I suppose you could leave stuff on PATH_INFO. But that doesn't seem >> to fit with the idea of PATH_INFO. Also, will it be strictly >> SCRIPT_NAME/consumed_path/PATH_INFO, or could it be >> SCRIPT_NAME/consumed_path/some_other_parsing/consumed_path/PATH_INFO >> -- after all, there's cases where stuff gets pushed from PATH_INFO to >> SCRIPT_NAME, and if consumed_path is in between, which one do you push >> stuff to? >> > > What do you intend by some_other_parsing? I have code that takes stuff from PATH_INFO and puts it on SCRIPT_NAME without updating routing_args. It could update routing_args... but I guess the question still remains: if there's multiple places where this kind of transformation is done, which one does SCRIPT_NAME point to? Ian From brian at briansmith.org Tue Jan 29 07:36:31 2008 From: brian at briansmith.org (Brian Smith) Date: Mon, 28 Jan 2008 22:36:31 -0800 Subject: [Web-SIG] HTTP 1.1 Expect/Continue handling Message-ID: <001b01c86241$4ae3d2a0$0501a8c0@T60> 1. The WSGI gateway must send the response headers immediately when the application yields its first non-empty string. 2. When there is an "100-continue" token in the request "Expect:" header, the WSGI gateway is allowed to delay sending the "100 Continue" response until the application reads from environ["wsgi.input"]. Consequently, if there is a 100-continue expectation, then a WSGI application must not read from wsgi.input after yielding its first non-empty string. For example, the following application results in undefined (probably erroneous) behavior: def application(environ, start_response): start_response("400 Bad Request", []) yield "400 Bad Request" environ["wsgi.input"].read(1) However, the following application must cause the WSGI gateway to send a 100-continue response: def application(environ, start_response): start_response("400 Bad Request", []) yield "" environ["wsgi.input"].read(1) PEP says that "[s]ervers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism." Should the application be able to detect whether there is a "100-continue" token in the Expect header of the request? Or, is the WSGI gateway allowed/required to hide the token? If the application cannot reliably detect the 100-continue token, then this implies an ordering constraint between yielding output and reading input that is not mentioned anywhere in the PEP. Another consequence is that an application cannot explicitly respond with a "100 Continue" itself, like this: def application(environ, start_response): start_response("100 Continue", []) yield "" start_response("200 OK", []) yield "OK" The reasons is that start_response cannot be called twice except when an exception is detected, and also the "100 Continue" would not be sent until right before the "200 OK" was sent anyway. Regards, Brian From foom at fuhm.net Tue Jan 29 08:53:02 2008 From: foom at fuhm.net (James Y Knight) Date: Tue, 29 Jan 2008 02:53:02 -0500 Subject: [Web-SIG] HTTP 1.1 Expect/Continue handling In-Reply-To: <001b01c86241$4ae3d2a0$0501a8c0@T60> References: <001b01c86241$4ae3d2a0$0501a8c0@T60> Message-ID: <86902310-08CF-4DAA-B5DD-7DCBF5ED5CEA@fuhm.net> On Jan 29, 2008, at 1:36 AM, Brian Smith wrote: > 1. The WSGI gateway must send the response headers immediately when > the application yields its first non-empty string. > > 2. When there is an "100-continue" token in the request "Expect:" > header, the WSGI gateway is allowed to delay sending the "100 > Continue" response until the application reads from > environ["wsgi.input"]. > > Consequently, if there is a 100-continue expectation, then a WSGI > application must not read from wsgi.input after yielding its first > non-empty string. > > For example, the following application results in undefined > (probably erroneous) behavior: > > def application(environ, start_response): > start_response("400 Bad Request", []) > yield "400 Bad Request" > environ["wsgi.input"].read(1) Agreed, this is ambiguous in the WSGI specs. However, there is a mitigating factor: The above example should not cause misbehavior when talking to well- designed clients. Clients are basically required to always send the request body, whether or not a 100-continue arrives, unless the connection gets closed, in order to work with older and misdesigned servers. They may delay a bit, to see if the server will close the connection, but otherwise ought to start sending the request body in any case. However, this omission in the WSGI spec does allow for violation of the HTTP RFC: > Upon receiving a request which includes an Expect request-header > field with the ?100-continue? expectation, an origin server MUST > either respond with 100 (Continue) status and continue to read from > the input stream, or respond with a final status code. The origin > server MUST NOT wait for the request body before sending the 100 > (Continue) response. If it responds with a final status code, it MAY > close the transport connection or it MAY continue to read and > discard the rest of the request. It MUST NOT perform the requested > method if it returns a final status code. If you changed your example to start_response("200 OK", []), that would violate the "MUST NOT perform the requested method" clause. I see three ways to resolve this: a) One is to clarify this as a requirement upon the WSGI gateway. Something like the following: "If the client requests Expect: 100-continue, and the application yields data before reading from the input, and the response code is a success (2xx) code, then the gateway MUST send a 100 continue response, before writing any other response headers in order to comply with RFC 2616 ?8.2.3 and to allow the WSGI application to read from the input stream later on in request processing". This should handle most real-world cases. Now, only sending 100 when the response code is 2xx may be potentially a bit fragile, and won't help e.g. your dummy app above. (maybe some real app really did want the input data even for an error response too?). But, on the other hand, you really *don't* want to force the transmission of a 100 continue when the server is sending e.g. a "400 Bad Request" response and likely won't ever read input data. b) Alternatively, the WSGI gateway could raise an exception when you attempt to respond with a success code without having read the input. This also satisfies RFC2616's prohibition against a successful execution of the request without a 100 continue response, but seems to me more likely to break things than help them, so I'd say (a) is strictly better. c) Another option is to clarify this as a requirement for a WSGI application: "An application must not read from wsgi.input after yielding its first non-empty string unless it has already read from wsgi.input before having yielded its first non-empty string. (environ["wsgi.input"].read(0) may be used to indicate the desire to read the input in the future and satisfy this requirement, without actually reading any data.)" The way I see it, (a) is not a change in the spec, but just a clarification. The combination of the current spec and HTTP RFC imply that you should do that already, in order to not violate 2616 (although it's quite likely nobody actually is, not having realized the requirement). (b) on the other hand, is truly a change in the spec, but is a bit theoretically cleaner. > Should the application be able to detect whether there is a "100- > continue" token in the Expect header of the request? No. > Or, is the WSGI gateway allowed/required to hide the token? Allowed. > Another consequence is that an application cannot explicitly respond > with a "100 Continue" itself, like this: > > def application(environ, start_response): > start_response("100 Continue", []) > yield "" > start_response("200 OK", []) > yield "OK" > > The reasons is that start_response cannot be called twice except > when an exception is detected, and also the "100 Continue" would not > be sent until right before the "200 OK" was sent anyway. That's not really a consequence of the above discussion, but, yes, that's true. James From graham.dumpleton at gmail.com Wed Jan 30 00:01:38 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 30 Jan 2008 10:01:38 +1100 Subject: [Web-SIG] HTTP 1.1 Expect/Continue handling In-Reply-To: <86902310-08CF-4DAA-B5DD-7DCBF5ED5CEA@fuhm.net> References: <001b01c86241$4ae3d2a0$0501a8c0@T60> <86902310-08CF-4DAA-B5DD-7DCBF5ED5CEA@fuhm.net> Message-ID: <88e286470801291501w664dc3b4j133b6376485ffde7@mail.gmail.com> On 29/01/2008, James Y Knight wrote: > > On Jan 29, 2008, at 1:36 AM, Brian Smith wrote: > > > 1. The WSGI gateway must send the response headers immediately when > > the application yields its first non-empty string. > > > > 2. When there is an "100-continue" token in the request "Expect:" > > header, the WSGI gateway is allowed to delay sending the "100 > > Continue" response until the application reads from > > environ["wsgi.input"]. > > > > Consequently, if there is a 100-continue expectation, then a WSGI > > application must not read from wsgi.input after yielding its first > > non-empty string. > > > > For example, the following application results in undefined > > (probably erroneous) behavior: > > > > def application(environ, start_response): > > start_response("400 Bad Request", []) > > yield "400 Bad Request" > > environ["wsgi.input"].read(1) > > Agreed, this is ambiguous in the WSGI specs. However, there is a > mitigating factor: > > The above example should not cause misbehavior when talking to well- > designed clients. Clients are basically required to always send the > request body, whether or not a 100-continue arrives, unless the > connection gets closed, in order to work with older and misdesigned > servers. They may delay a bit, to see if the server will close the > connection, but otherwise ought to start sending the request body in > any case. > > However, this omission in the WSGI spec does allow for violation of > the HTTP RFC: > > Upon receiving a request which includes an Expect request-header > > field with the "100-continue" expectation, an origin server MUST > > either respond with 100 (Continue) status and continue to read from > > the input stream, or respond with a final status code. The origin > > server MUST NOT wait for the request body before sending the 100 > > (Continue) response. If it responds with a final status code, it MAY > > close the transport connection or it MAY continue to read and > > discard the rest of the request. It MUST NOT perform the requested > > method if it returns a final status code. > > If you changed your example to start_response("200 OK", []), that > would violate the "MUST NOT perform the requested method" clause. > > I see three ways to resolve this: > > a) One is to clarify this as a requirement upon the WSGI gateway. > Something like the following: > "If the client requests Expect: 100-continue, and the application > yields data before reading from the input, and the response code is a > success (2xx) code, then the gateway MUST send a 100 continue > response, before writing any other response headers in order to comply > with RFC 2616 ?8.2.3 and to allow the WSGI application to read from > the input stream later on in request processing". > > This should handle most real-world cases. Now, only sending 100 when > the response code is 2xx may be potentially a bit fragile, and won't > help e.g. your dummy app above. (maybe some real app really did want > the input data even for an error response too?). But, on the other > hand, you really *don't* want to force the transmission of a 100 > continue when the server is sending e.g. a "400 Bad Request" response > and likely won't ever read input data. > > b) Alternatively, the WSGI gateway could raise an exception when you > attempt to respond with a success code without having read the input. > This also satisfies RFC2616's prohibition against a successful > execution of the request without a 100 continue response, but seems to > me more likely to break things than help them, so I'd say (a) is > strictly better. > > c) Another option is to clarify this as a requirement for a WSGI > application: "An application must not read from wsgi.input after > yielding its first non-empty string unless it has already read from > wsgi.input before having yielded its first non-empty string. > (environ["wsgi.input"].read(0) may be used to indicate the desire to > read the input in the future and satisfy this requirement, without > actually reading any data.)" A clarification in the specification may be required to the extent of saying that where a zero length read is done, that no WSGI middleware which wraps wsgi.input, nor even the WSGI adapter itself may optimise it away. In other words a zero length read must always be passed through unless specifically not appropriate for what the WSGI middleware is doing. This would be required to ensure that zero length read always propagates down to the web server layer itself such that it may trigger the 100-continue. This requirement would probably exist independent of (c) being used as a solution. Graham > The way I see it, (a) is not a change in the spec, but just a > clarification. The combination of the current spec and HTTP RFC imply > that you should do that already, in order to not violate 2616 > (although it's quite likely nobody actually is, not having realized > the requirement). (b) on the other hand, is truly a change in the > spec, but is a bit theoretically cleaner. > > > Should the application be able to detect whether there is a "100- > > continue" token in the Expect header of the request? > > No. > > > Or, is the WSGI gateway allowed/required to hide the token? > > Allowed. > > > Another consequence is that an application cannot explicitly respond > > with a "100 Continue" itself, like this: > > > > def application(environ, start_response): > > start_response("100 Continue", []) > > yield "" > > start_response("200 OK", []) > > yield "OK" > > > > The reasons is that start_response cannot be called twice except > > when an exception is detected, and also the "100 Continue" would not > > be sent until right before the "200 OK" was sent anyway. > > That's not really a consequence of the above discussion, but, yes, > that's true. > > James > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From graham.dumpleton at gmail.com Wed Jan 30 11:08:21 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 30 Jan 2008 21:08:21 +1100 Subject: [Web-SIG] Prototype of wsgi.input.readline(). Message-ID: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> As I think we all know, no one implements readline() for wsgi.input as defined in the WSGI specification. The reason for this is that stuff like cgi.FieldStorage would refuse to work and would just generate an exception. This is because cgi.FieldStorage expects to pass an argument to readline(). So, although this is linked in the issues list for possible amendments to WSGI specification, there hasn't that I recall been a discussion on how readline() would be defined in any amendment or future version. In particular, would the specification be changed to either: 1. readline(size) where size argument is mandatory, or: 2. readline(size=-1) where size argument is optional. If the size argument is made mandatory, then it would parallel how read() function is defined, but this in itself would mean cgi.FieldStorage would break. This is because cgi.FieldStorage actually calls readline() with no argument as well as an argument in different places in the code. If we allow the argument to be optional however, we run into the same portability problems that would exist with some WSGI adapters which do not simulate EOF on input when all request content is read. Specifically, if user code calls readline() with no argument but the last line of the file wasn't terminated with a EOL, then it would hang. As it is, cgi.FieldStorage only works on systems which do not simulate EOF because the content format it is decoding has its own concept of end of stream marker and cgi.FieldStorage implementation specifically looks for that. The cgi.FieldStorage implementation certainly doesn't track how much input it has read in and progressively change the size argument to readline() on that basis. Any other code which uses readline() with no argument would similarly have to depend on some concept of an end of stream marker in the content, because one can't rely on getting an empty string when input is exhausted, In some respects this highlights the inconsistency of the read() argument not being optional. This is because one of the reasons for not allowing read() argument to be optional is that it would be problematical for implementations that do not simulate EOF, yet the same issue exists with readline() and an optional argument has to be allowed for that because of how cgi.FieldStorage is implemented. Graham From chrism at plope.com Wed Jan 30 17:24:50 2008 From: chrism at plope.com (Chris McDonough) Date: Wed, 30 Jan 2008 11:24:50 -0500 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> Message-ID: <47A0A4D2.3050207@plope.com> Graham Dumpleton wrote: > As I think we all know, no one implements readline() for wsgi.input as > defined in the WSGI specification. The reason for this is that stuff > like cgi.FieldStorage would refuse to work and would just generate an > exception. This is because cgi.FieldStorage expects to pass an > argument to readline(). I haven't been keeping up on the issues this has caused wrt WSGI, but note that the reason that cgi.FieldStorage passes a size argument to readline is in order to prevent memory exhaustion when reading files that don't have any linebreaks (denial of service). See http://bugs.python.org/issue1112549 . > > So, although this is linked in the issues list for possible amendments > to WSGI specification, there hasn't that I recall been a discussion on > how readline() would be defined in any amendment or future version. > > In particular, would the specification be changed to either: > > 1. readline(size) where size argument is mandatory, or: > > 2. readline(size=-1) where size argument is optional. > > If the size argument is made mandatory, then it would parallel how > read() function is defined, but this in itself would mean > cgi.FieldStorage would break. > > This is because cgi.FieldStorage actually calls readline() with no > argument as well as an argument in different places in the code. cgi.FieldStorage doesn't call readline() without an argument. cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I don't know if this changes anything. - C From brian at briansmith.org Thu Jan 31 03:12:40 2008 From: brian at briansmith.org (Brian Smith) Date: Wed, 30 Jan 2008 18:12:40 -0800 Subject: [Web-SIG] HTTP 1.1 Expect/Continue handling In-Reply-To: <88e286470801291501w664dc3b4j133b6376485ffde7@mail.gmail.com> References: <001b01c86241$4ae3d2a0$0501a8c0@T60> <86902310-08CF-4DAA-B5DD-7DCBF5ED5CEA@fuhm.net> <88e286470801291501w664dc3b4j133b6376485ffde7@mail.gmail.com> Message-ID: <003701c863ae$c3198cb0$0501a8c0@T60> Graham Dumpleton wrote: > On 29/01/2008, James Y Knight wrote: > a) One is to clarify this as a requirement upon the WSGI gateway. > > Something like the following: > > "If the client requests Expect: 100-continue, and the application > > yields data before reading from the input, and the response > > code is a success (2xx) code, then the gateway MUST send a > > 100 continue response, before writing any other response headers > > in order to comply with RFC 2616 ?8.2.3 and to allow the WSGI > > application to read from the input stream later on in request > > processing". This requirement is goes too far. I think the part of the specification that says the server most not perform the requested operation is over-reaching. It fails to consider the case where the server can successfully perform the operation without reading the request body. For example, consider a TOUCH method that updates the ETag and Last-Modified date of a resource. Or, a DELETE (a DELETE request shouldn't have a request body, but should the server really be required to check for one and refuse to delete the resource if it finds one?). The WSGI gateway MAY send a 100 continue response in this situation, but it shouldn't be required to. If the application wants the stricter semantics then it should be coded to handle it. > > This should handle most real-world cases. Now, only sending > > 100 when the response code is 2xx may be potentially a bit > > fragile, and won't help e.g. your dummy app above. > > (maybe some real app really did want the input data even > > for an error response too?). But, on the other hand, you > > really *don't* want to force the transmission of a 100 > > continue when the server is sending e.g. a "400 Bad > > Request" response and likely won't ever read input data. Exactly, if you always send 100 continue then you defeat the purpose of it entirely. I would like to see the specification revised so that it is obvious that my example program is invalid when a "Expect: 100 continue" response header is present. > > b) Alternatively, the WSGI gateway could raise an exception > > when you attempt to respond with a success code without having > > read the input. For the same reasons I mentioned above, this is too strict. > > c) Another option is to clarify this as a requirement for a WSGI > > application: "An application must not read from wsgi.input after > > yielding its first non-empty string unless it has already read from > > wsgi.input before having yielded its first non-empty string. This is the requirement that I want to see. But, I prefer to have it qualified with "when environ['HTTP_EXPECT'] contains the '100-continue' token". > > (environ["wsgi.input"].read(0) may be used to indicate the > > desire to read the input in the future and satisfy this > > requirement, without actually reading any data.)" Nice in theory, but if the specification is going to change to support this, I would rather see the specification change to allow the application to generate its own "100 continue" response. > A clarification in the specification may be required to the > extent of saying that where a zero length read is done, that > no WSGI middleware which wraps wsgi.input, nor even the WSGI > adapter itself may optimise it away. In other words a zero > length read must always be passed through unless specifically > not appropriate for what the WSGI middleware is doing. > > This would be required to ensure that zero length read always > propagates down to the web server layer itself such that it > may trigger the 100-continue. The statement "An application must not read from wsgi.input after..." would already apply to middleware, because middleware are applications. If the middleware causes no response data to be read, it should not be required to cause a "100 continue" to be sent. - Brian From brian at briansmith.org Thu Jan 31 03:26:28 2008 From: brian at briansmith.org (Brian Smith) Date: Wed, 30 Jan 2008 18:26:28 -0800 Subject: [Web-SIG] Prohibiting reading from wsgi.input in an application iterable's close method Message-ID: <003801c863b0$b5d9c180$0501a8c0@T60> I would like to see the following requirement added to the WSGI specification: An application may only methods on environ["wsgi.input"] before it returns its response iterable, or from within an execution of its iterable's next() method. In particular, the application iterable's close() method, MUST NOT read from wsgi.input. Thoughts? - Brian From graham.dumpleton at gmail.com Thu Jan 31 03:42:11 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 31 Jan 2008 13:42:11 +1100 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <47A0A4D2.3050207@plope.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> <47A0A4D2.3050207@plope.com> Message-ID: <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> On 31/01/2008, Chris McDonough wrote: > Graham Dumpleton wrote: > > As I think we all know, no one implements readline() for wsgi.input as > > defined in the WSGI specification. The reason for this is that stuff > > like cgi.FieldStorage would refuse to work and would just generate an > > exception. This is because cgi.FieldStorage expects to pass an > > argument to readline(). > > I haven't been keeping up on the issues this has caused wrt WSGI, but note that > the reason that cgi.FieldStorage passes a size argument to readline is in order > to prevent memory exhaustion when reading files that don't have any linebreaks > (denial of service). See http://bugs.python.org/issue1112549 . The interesting comment in that bug is: """The input data is not required by the RFC 822/1521/1522/1867 specifications to contain any newline characters.""" If that can occur, then a WSGI adapter which didn't simulate EOF would fail in that the read would block and never return. All the more reason that simulating EOF needs to be mandatory. > > So, although this is linked in the issues list for possible amendments > > to WSGI specification, there hasn't that I recall been a discussion on > > how readline() would be defined in any amendment or future version. > > > > In particular, would the specification be changed to either: > > > > 1. readline(size) where size argument is mandatory, or: > > > > 2. readline(size=-1) where size argument is optional. > > > > If the size argument is made mandatory, then it would parallel how > > read() function is defined, but this in itself would mean > > cgi.FieldStorage would break. > > > > This is because cgi.FieldStorage actually calls readline() with no > > argument as well as an argument in different places in the code. > > cgi.FieldStorage doesn't call readline() without an argument. > cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I > don't know if this changes anything. Not really, I should have said 'cgi' module as a whole rather than specifically cgi.FieldStorage. Given that people might be using cgi.parse_multipart in standard CGI, there would probably still be an expectation that it worked for WSGI. We can't really say that you can use cgi.FieldStorage but not cgi.parse_multipart. People will just expect all the normal tools people would use for this to work. Graham From chrism at plope.com Thu Jan 31 03:56:59 2008 From: chrism at plope.com (Chris McDonough) Date: Wed, 30 Jan 2008 21:56:59 -0500 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> <47A0A4D2.3050207@plope.com> <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> Message-ID: <47A138FB.7010103@plope.com> Graham Dumpleton wrote: > >>> >>> If the size argument is made mandatory, then it would parallel how >>> read() function is defined, but this in itself would mean >>> cgi.FieldStorage would break. >>> >>> This is because cgi.FieldStorage actually calls readline() with no >>> argument as well as an argument in different places in the code. >> cgi.FieldStorage doesn't call readline() without an argument. >> cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I >> don't know if this changes anything. > > Not really, I should have said 'cgi' module as a whole rather than > specifically cgi.FieldStorage. Given that people might be using > cgi.parse_multipart in standard CGI, there would probably still be an > expectation that it worked for WSGI. We can't really say that you can > use cgi.FieldStorage but not cgi.parse_multipart. People will just > expect all the normal tools people would use for this to work. Personally, I think parse_multipart should go away. It's not suitable for anything but toy usage. If people use it, and they expose their site to the world, arbitrary anonymous visitors can cause their Python's process size to grow to arbitrarily. I don't think any existing well-known framework uses it, for this very reason. If it can't go away, and there's a problem due to the non-parity between parse_multipart's use and FieldStorage's use, I suspect the right answer is to change cgi.parse_multipart to pass in a size value for readline too. I probably should have done that when I made the patch. :-( - C From graham.dumpleton at gmail.com Thu Jan 31 04:30:22 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 31 Jan 2008 14:30:22 +1100 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <47A138FB.7010103@plope.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> <47A0A4D2.3050207@plope.com> <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> <47A138FB.7010103@plope.com> Message-ID: <88e286470801301930s2b0fbeb2ld2be580284801090@mail.gmail.com> On 31/01/2008, Chris McDonough wrote: > Graham Dumpleton wrote: > > > >>> > >>> If the size argument is made mandatory, then it would parallel how > >>> read() function is defined, but this in itself would mean > >>> cgi.FieldStorage would break. > >>> > >>> This is because cgi.FieldStorage actually calls readline() with no > >>> argument as well as an argument in different places in the code. > >> cgi.FieldStorage doesn't call readline() without an argument. > >> cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I > >> don't know if this changes anything. > > > > Not really, I should have said 'cgi' module as a whole rather than > > specifically cgi.FieldStorage. Given that people might be using > > cgi.parse_multipart in standard CGI, there would probably still be an > > expectation that it worked for WSGI. We can't really say that you can > > use cgi.FieldStorage but not cgi.parse_multipart. People will just > > expect all the normal tools people would use for this to work. > > Personally, I think parse_multipart should go away. It's not suitable for > anything but toy usage. Not necessarily. Someone may see it as a trade off. The code itself says: """This is easy to use but not much good if you are expecting megabytes to be uploaded -- in that case, use the FieldStorage class instead which is much more flexible.""" So comment implies it is easier to use and so some may think it is simpler for what they are doing if they are only dealing with small requests. Of course, it would probably be prudent if you know your requests are always going to be small to use LimitRequestBody in Apache, or a specific check on content length if handled in Python code, to block someone sending over sized requests intentionally to try and break things. Provided you did this, may be quite reasonable to use it in specific circumstances. > If people use it, and they expose their site to the world, arbitrary anonymous > visitors can cause their Python's process size to grow to arbitrarily. I don't > think any existing well-known framework uses it, for this very reason. > > If it can't go away, and there's a problem due to the non-parity between > parse_multipart's use and FieldStorage's use, I suspect the right answer is to > change cgi.parse_multipart to pass in a size value for readline too. I probably > should have done that when I made the patch. :-( Graham From chrism at plope.com Thu Jan 31 04:38:22 2008 From: chrism at plope.com (Chris McDonough) Date: Wed, 30 Jan 2008 22:38:22 -0500 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <88e286470801301930s2b0fbeb2ld2be580284801090@mail.gmail.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> <47A0A4D2.3050207@plope.com> <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> <47A138FB.7010103@plope.com> <88e286470801301930s2b0fbeb2ld2be580284801090@mail.gmail.com> Message-ID: <47A142AE.8070605@plope.com> Graham Dumpleton wrote: > On 31/01/2008, Chris McDonough wrote: >> Graham Dumpleton wrote: >>>>> If the size argument is made mandatory, then it would parallel how >>>>> read() function is defined, but this in itself would mean >>>>> cgi.FieldStorage would break. >>>>> >>>>> This is because cgi.FieldStorage actually calls readline() with no >>>>> argument as well as an argument in different places in the code. >>>> cgi.FieldStorage doesn't call readline() without an argument. >>>> cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I >>>> don't know if this changes anything. >>> Not really, I should have said 'cgi' module as a whole rather than >>> specifically cgi.FieldStorage. Given that people might be using >>> cgi.parse_multipart in standard CGI, there would probably still be an >>> expectation that it worked for WSGI. We can't really say that you can >>> use cgi.FieldStorage but not cgi.parse_multipart. People will just >>> expect all the normal tools people would use for this to work. >> Personally, I think parse_multipart should go away. It's not suitable for >> anything but toy usage. > > Not necessarily. Someone may see it as a trade off. The code itself says: > > """This is easy to use but not > much good if you are expecting megabytes to be uploaded -- in that case, > use the FieldStorage class instead which is much more flexible.""" > > So comment implies it is easier to use and so some may think it is > simpler for what they are doing if they are only dealing with small > requests. > > Of course, it would probably be prudent if you know your requests are > always going to be small to use LimitRequestBody in Apache, or a > specific check on content length if handled in Python code, to block > someone sending over sized requests intentionally to try and break > things. Provided you did this, may be quite reasonable to use it in > specific circumstances. Indeed. But then again, I doubt the casual user would be able to make this judgment and take the necessary precautions. This kind of user is likely the same class of user for whom CGI.FieldStorage is "too hard" (which it really isn't). - C From brian at briansmith.org Thu Jan 31 05:10:56 2008 From: brian at briansmith.org (Brian Smith) Date: Wed, 30 Jan 2008 20:10:56 -0800 Subject: [Web-SIG] Reading of input after headers sent and 100-continue. In-Reply-To: <88e286470801301922n162b955at431b0354c1a30597@mail.gmail.com> References: <88e286470801291628g6460053ejeed734de499784e9@mail.gmail.com> <003901c863b4$1111ee30$0501a8c0@T60> <88e286470801301922n162b955at431b0354c1a30597@mail.gmail.com> Message-ID: <003b01c863bf$47eff270$0501a8c0@T60> Graham Dumpleton wrote: > Effectively, if a 200 response came back, it seems to suggest > that the client still should send the request body, just that > it 'SHOULD NOT wait for an indefinite period'. It doesn't say > explicitly for the client that it shouldn't still send the > request body if another response code comes back. This behavior is to support servers that don't understand the Expect: header. Basically, if the server responds with a 100, the client must send the request body. If the server responds with a 4xx or 5xx, the client must not send the request body. If the server responds with a 2xx or a 3xx, then the client should must send (the rest of) the request body, on the assumption that the server doesn't understand "Expect:". To be completely compliant, a server should always respond with a 100 in front of a 2xx or 3xx, I guess. Thanks for clarifying that for me. I guess the rules make sense after all. > So technically, if the client has to still send the request > content, something could still read it. It would not be ideal > that there is a delay depending on what the client does, but > would still be possible from what I read of this section. You are right. To avoid confusion, you should probably force mod_wsgi to send a 100-continue in front of any 2xx or 3xx response. > It MUST NOT perform the requested method if it returns a final status code. The implication is that the only time it will avoid sending a 100 is when it is sending a 4xx, and it should never perform the requested method if it already said the method failed. The only excuse for not sending a 100 is that you don't know about "Expect: 100-continue". But, that can't be true if you are reading this part of the spec! > """If it responds with a final status > code, it MAY close the transport connection or it MAY continue > to read and discard the rest of the request.""" If the client receives a 2xx or 3xx without a 100 first, it has to send the request body (well, depending on which 3xx it is, that is not true). But, the server doesn't have to read it! But, again, the assumption is that the server will only send a response without a 100 if it is a 4xx or 5xx. > It seems by what you are saying that if 100-continue is > present this wouldn't be allowed, and that to ensure correct > behaviour the handler would have to read at least some of the > request body before sending back the response headers. You are right, I was wrong. > > Since ap_http_filter is an input filter only, it should be > enough to > > just avoid reading from the input brigade. (AFAICT, anyway.) > > In other words block the handler from reading, potentially > raise an error in the process. Except to be fair and > consistent, you would have to apply the same rule even if > 100-continue isn't present. Whether that would break some > existing code in doing that is the concern I have, even if it > is some simple test program that just echos back the request > body as the response body. Technically, even if the server returns a 4xx, it can still read the request body, but it might not get anything or it might only get part of it. I guess, the change to the WSGI spec that is needed is to say that the gateway must not send the "100 continue" if it has already sent some headers, and that it should send a "100 continue" before any 2xx or 3xx code, which is basically what James Knight suggested (sorry James). The gateway must indicate EOF if only a partial request body was received. I don't think the gateway should be required to provide any of the partial request content on a 4xx, though. - Brian From graham.dumpleton at gmail.com Thu Jan 31 05:33:19 2008 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Thu, 31 Jan 2008 15:33:19 +1100 Subject: [Web-SIG] Reading of input after headers sent and 100-continue. In-Reply-To: <003b01c863bf$47eff270$0501a8c0@T60> References: <88e286470801291628g6460053ejeed734de499784e9@mail.gmail.com> <003901c863b4$1111ee30$0501a8c0@T60> <88e286470801301922n162b955at431b0354c1a30597@mail.gmail.com> <003b01c863bf$47eff270$0501a8c0@T60> Message-ID: <88e286470801302033u12542cafq2551ad164dc89713@mail.gmail.com> For those on the Python web sig who might be thinking they missed part of the conversation, you have. This is the second half of a conversation started on Apache modules-dev list about Apache 100-continue processing. If interested, you can see the first half of the conversation at: http://mail-archives.apache.org/mod_mbox/httpd-modules-dev/200801.mbox/browser Graham On 31/01/2008, Brian Smith wrote: > Graham Dumpleton wrote: > > Effectively, if a 200 response came back, it seems to suggest > > that the client still should send the request body, just that > > it 'SHOULD NOT wait for an indefinite period'. It doesn't say > > explicitly for the client that it shouldn't still send the > > request body if another response code comes back. > > This behavior is to support servers that don't understand the Expect: > header. > > Basically, if the server responds with a 100, the client must send the > request body. If the server responds with a 4xx or 5xx, the client must > not send the request body. If the server responds with a 2xx or a 3xx, > then the client should must send (the rest of) the request body, on the > assumption that the server doesn't understand "Expect:". To be > completely compliant, a server should always respond with a 100 in front > of a 2xx or 3xx, I guess. Thanks for clarifying that for me. I guess the > rules make sense after all. > > > So technically, if the client has to still send the request > > content, something could still read it. It would not be ideal > > that there is a delay depending on what the client does, but > > would still be possible from what I read of this section. > > You are right. To avoid confusion, you should probably force mod_wsgi to > send a 100-continue in front of any 2xx or 3xx response. > > > It MUST NOT perform the requested method if it returns a final status > code. > > The implication is that the only time it will avoid sending a 100 is > when it is sending a 4xx, and it should never perform the requested > method if it already said the method failed. The only excuse for not > sending a 100 is that you don't know about "Expect: 100-continue". But, > that can't be true if you are reading this part of the spec! > > > """If it responds with a final status > > code, it MAY close the transport connection or it MAY continue > > to read and discard the rest of the request.""" > > If the client receives a 2xx or 3xx without a 100 first, it has to send > the request body (well, depending on which 3xx it is, that is not true). > But, the server doesn't have to read it! But, again, the assumption is > that the server will only send a response without a 100 if it is a 4xx > or 5xx. > > > It seems by what you are saying that if 100-continue is > > present this wouldn't be allowed, and that to ensure correct > > behaviour the handler would have to read at least some of the > > request body before sending back the response headers. > > You are right, I was wrong. > > > > Since ap_http_filter is an input filter only, it should be > > enough to > > > just avoid reading from the input brigade. (AFAICT, anyway.) > > > > In other words block the handler from reading, potentially > > raise an error in the process. Except to be fair and > > consistent, you would have to apply the same rule even if > > 100-continue isn't present. Whether that would break some > > existing code in doing that is the concern I have, even if it > > is some simple test program that just echos back the request > > body as the response body. > > Technically, even if the server returns a 4xx, it can still read the > request body, but it might not get anything or it might only get part of > it. I guess, the change to the WSGI spec that is needed is to say that > the gateway must not send the "100 continue" if it has already sent some > headers, and that it should send a "100 continue" before any 2xx or 3xx > code, which is basically what James Knight suggested (sorry James). The > gateway must indicate EOF if only a partial request body was received. I > don't think the gateway should be required to provide any of the partial > request content on a 4xx, though. > > - Brian > > From tseaver at palladion.com Thu Jan 31 10:06:42 2008 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 31 Jan 2008 04:06:42 -0500 Subject: [Web-SIG] Prototype of wsgi.input.readline(). In-Reply-To: <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> References: <88e286470801300208r42bbe979gde2297a741cf3194@mail.gmail.com> <47A0A4D2.3050207@plope.com> <88e286470801301842s3b3a5ac0o5350cf27d6e7771f@mail.gmail.com> Message-ID: <47A18FA2.1010303@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Graham Dumpleton wrote: > On 31/01/2008, Chris McDonough wrote: >> cgi.FieldStorage doesn't call readline() without an argument. >> cgi.parse_multipart does, but this function is not used by cgi.FieldStorage. I >> don't know if this changes anything. > > Not really, I should have said 'cgi' module as a whole rather than > specifically cgi.FieldStorage. Given that people might be using > cgi.parse_multipart in standard CGI, there would probably still be an > expectation that it worked for WSGI. We can't really say that you can > use cgi.FieldStorage but not cgi.parse_multipart. People will just > expect all the normal tools people would use for this to work. cgi.parse_multipart is now deprecated (at least in Python 2.4.4 and 2.5.1) (per the bugreport), with the following in its docstring: XXX This should really be subsumed by FieldStorage altogether -- no point in having two implementations of the same parsing algorithm. Also, FieldStorage protects itself better against certain DoS attacks by limiting the size of the data read in one chunk. The API here does not support that kind of protection. This also affects parse() since it can call parse_multipart(). Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHoY+i+gerLs4ltQ4RAh9yAJoDUcMc2aIzmBXWx7TnLV2flhAU/QCeNpDy r6FyJT6s6L6QAfpxA0Ss5+E= =jkNC -----END PGP SIGNATURE-----