From davidgshi at yahoo.co.uk Wed Sep 9 12:52:44 2009 From: davidgshi at yahoo.co.uk (David Shi) Date: Wed, 9 Sep 2009 10:52:44 +0000 (GMT) Subject: [Web-SIG] Shortening execution time of Python script Message-ID: <444720.89248.qm@web26308.mail.ukl.yahoo.com> I have a Python script that automatically downloads zip files containing large datasets from another server and then unzips the files to further process the data. ? It has been used as a geoprocessor of ArcGIS Server. ? The script works fine when two datasets each has several kilobytes size, but the script stops half way when datasets were about 11,000KBytes. ? I think that the execution time is too long and ArcGIS Server just simply killed the process. ? What actions can I try to reduce the execution time? ? ArcGIS Server only works on the basis of 32 bits and I was told that the maximum memory it can utilise is 4 MBytes. ? I should be grateful if someone can make suggestions/recommendations. ? Sincerely, ? David -------------- next part -------------- An HTML attachment was scrubbed... URL: From dirkjan at ochtman.nl Wed Sep 9 13:14:55 2009 From: dirkjan at ochtman.nl (Dirkjan Ochtman) Date: Wed, 9 Sep 2009 13:14:55 +0200 Subject: [Web-SIG] Shortening execution time of Python script In-Reply-To: <444720.89248.qm@web26308.mail.ukl.yahoo.com> References: <444720.89248.qm@web26308.mail.ukl.yahoo.com> Message-ID: On Wed, Sep 9, 2009 at 12:52, David Shi wrote: > What actions can I try to reduce the execution time? This has absolutely nothing to do with the interests of the Web-SIG. Please send more general Python questions to http://mail.python.org/mailman/listinfo/python-list. Cheers, Dirkjan From ionel.mc at gmail.com Wed Sep 9 13:34:49 2009 From: ionel.mc at gmail.com (Ionel Maries Cristian) Date: Wed, 9 Sep 2009 14:34:49 +0300 Subject: [Web-SIG] Shortening execution time of Python script In-Reply-To: <444720.89248.qm@web26308.mail.ukl.yahoo.com> References: <444720.89248.qm@web26308.mail.ukl.yahoo.com> Message-ID: You're a bit contradicting yourself - what's the actual problem, process memory size or execution time ? If it's the process memory size you could trick ArcGIS by using a subprocess that does the actual work (and eats into the memory). -- ionel On Wed, Sep 9, 2009 at 13:52, David Shi wrote: > I have a Python script that automatically downloads zip files containing > large datasets from another server and then unzips the files to further > process the data. > > It has been used as a geoprocessor of ArcGIS Server. > > The script works fine when two datasets each has several kilobytes size, > but the script stops half way when datasets were about 11,000KBytes. > > I think that the execution time is too long and ArcGIS Server just simply > killed the process. > > What actions can I try to reduce the execution time? > > ArcGIS Server only works on the basis of 32 bits and I was told that the > maximum memory it can utilise is 4 MBytes. > > I should be grateful if someone can make suggestions/recommendations. > > Sincerely, > > David > > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/ionel.mc%40gmail.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pstradomski at gmail.com Wed Sep 9 22:32:02 2009 From: pstradomski at gmail.com (=?utf-8?q?Pawe=C5=82_Stradomski?=) Date: Wed, 9 Sep 2009 22:32:02 +0200 Subject: [Web-SIG] Shortening execution time of Python script In-Reply-To: <444720.89248.qm@web26308.mail.ukl.yahoo.com> References: <444720.89248.qm@web26308.mail.ukl.yahoo.com> Message-ID: <200909092232.02784.pstradomski@gmail.com> W li?cie David Shi z dnia ?roda 09 wrze?nia 2009: > ArcGIS Server only works on the basis of 32 bits and I was told that the > maximum memory it can utilise is 4 MBytes. 32bits give 4GB address space, not 4 MB > I should be grateful if someone can make suggestions/recommendations. > > Sincerely, > > David -- Pawe? Stradomski From chris at simplistix.co.uk Thu Sep 10 09:39:20 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Thu, 10 Sep 2009 08:39:20 +0100 Subject: [Web-SIG] Shortening execution time of Python script In-Reply-To: <444720.89248.qm@web26308.mail.ukl.yahoo.com> References: <444720.89248.qm@web26308.mail.ukl.yahoo.com> Message-ID: <4AA8AD28.4070800@simplistix.co.uk> David Shi wrote: > I have a Python script that automatically downloads zip files containing > large datasets from another server and then unzips the files to further > process the data. This smells more like it belongs on comp.lang.python that wb-sig, but here goes... > The script works fine when two datasets each has several kilobytes size, > but the script stops half way when datasets were about 11,000KBytes. > > I think that the execution time is too long and ArcGIS Server just > simply killed the process. How does ArcGIS execute this script? > What actions can I try to reduce the execution time? > > ArcGIS Server only works on the basis of 32 bits and I was told that the > maximum memory it can utilise is 4 MBytes. For speed analysis, run the script through cProfile: http://docs.python.org/library/profile.html For analysis of how much memory your script is using, use heapy: http://guppy-pe.sourceforge.net/heapy_tutorial.html However, you mention downloading large files. Are you using httplib, urllib or urllib2 to do this? If so, you could be suffering from this bug: http://bugs.python.org/issue6838 cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From mjmoran at gmail.com Sat Sep 12 23:10:56 2009 From: mjmoran at gmail.com (Michael Moran) Date: Sat, 12 Sep 2009 16:10:56 -0500 Subject: [Web-SIG] Getting POST data in Python 3.1.1 Message-ID: <4AAC0E60.2040103@gmail.com> Hello, I've been following this list some and watching the bugs and it looks like there are currently some issues, specifically with getting data out of wsgi.input or getting POST data with cgi.FieldStorage . Has anyone made on progress on getting this working? Or has anyone developed any modules that replace the FieldStorage functionality? I've been spinning my wheels trying to get at the POST data, and I'm still not 100% sure what needs to be done, but if anyone has made any progress please let me know. I would really like to be able to do more than GET requests, and I would imagine as Python 3.X usage picks up others will too. Thanks, Michael Moran From armin.ronacher at active-4.com Thu Sep 17 17:52:53 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Thu, 17 Sep 2009 17:52:53 +0200 Subject: [Web-SIG] Strings in Jython [Graham's WSGI for py3] Message-ID: <4AB25B55.9050608@active-4.com> Hi, This is my first reply in a list of replies for Grahams lengthy blog post about WSGI 3 [1]. I break it up into multiple separate threads so that this can be discussed easier. > What should be highlighted is that for Jython, as I understand it at > least, when reading from a socket connection it returns a unicode > string. That unicode string will only have characters in the range > \u0000 through \u00FF, inclusive. Further, it is possible to transcode > that unicode string without needing to go through a separate byte > string type. On Jython 2.5 (the only one I tested) there is a 'str' and 'unicode' type and sockets return strings. I can't see much difference to cpython here. Is the Jython unicode issue really (still) relevant? I can see that IronPython has only one string type, but they are doing fine handling binary data in their unicode? ones. Regards, Armin [1]: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html From armin.ronacher at active-4.com Thu Sep 17 17:57:38 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Thu, 17 Sep 2009 17:57:38 +0200 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] Message-ID: <4AB25C72.50004@active-4.com> Hi, Graham mentioned that the WSGI development might further drift apart based on the changes Ian Bicking and I did on DjangoCon in a separate hg repository [1] for the WSGI PEP. I just want to point out that these are in no way final and are further intended to only clarify some of the wrong wordings for Python 2, give us a real readline() function on the input stream and get rid of useless old cruft such as Python 2.2 support and Jython compatibility which no longer appears to be a problem. My personal Idea would be making that PEP WSGI 1.1 and having a separate one for Python 3. The reason for pushing up the number would be that frameworks then can figure out if they have to safely process the input stream because there is no useful readline function or not. Regards, Armin [1]: http://bitbucket.org/ianb/wsgi-peps/ From armin.ronacher at active-4.com Thu Sep 17 18:26:52 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Thu, 17 Sep 2009 18:26:52 +0200 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] Message-ID: <4AB2634C.2070009@active-4.com> Hi, Graham currently proposes[1] the following behaviors for Strings in WSGI (Python version independent). However this mail only covers the Python 3 part which I assume becomes a separate section in the PEP or even WSGI version. Terminology: byte string == contains bytes unicode string == contains unicode charpoints* native string == what the python version uses a a string (bytes in python 2, unicode in python 3) * ucs2 / ucs4 is ignored here. You might still have problems with surrogate pairs in ucs2 python builds and jython. > 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI > environment, the value of the variable should be a native string. URLs in general are a tricky topic. For this particular field it does not matter if we decide on bytes or unicode because it will always only contain ASCII characters. This should be picked consistencly with the type of PATH_INFO and SCRIPT_NAME. > 3. For the CGI variables contained in the WSGI environment, the values > of the variables are byte strings. \o/ Totally agree with that. > 4. The WSGI input stream 'wsgi.input' contained in the WSGI > environment and from which request content is read, should yield byte > strings. Same thing. > 5. The status line specified by the WSGI application must be a byte > string. Ditto. > 6. The list of response headers specified by the WSGI application must > contain tuples consisting of two values, where each value is a byte > string. Makes sense because people stuff a lot of non latin1 stuff in there. However I'm fine with latin1 for headers here as well but that would probably only affect cookie and custom headers. > 7. The iterable returned by the application and from which response > content is derived, must yield byte strings. I totally agree. However Graham moves further away from that in the rest of the blog post because he wants to point out that people use WSGI directly and that explicit bytestrings in Python 3 confuse people. The latest iteration in the blog post is not to use bytestrings in a single location except for headers and the input stream. I thought a lot about this in the past and I welcome the step to make WSGI harder to use! This might sound absurd, but once encodings are really explicit, people will think about it. I think we should discourage *applications* written in WSGI and link to implementations in the PEP. The big problems are always PATH_INFO and SCRIPT_NAME. Those are the only values that are in the dict URL-decoded and might contain non-ASCII characters. (except for headers, but that's a different story because the only real-world problem there are cookie headers and those are troubleing for more reasons than just character sets) My latest change to the WSGI sandbox hg repo [2] was that I added a notice that later PEP revisions might document a RAW_SCRIPT_NAME or something that contains the URL quoted values. It however turns out that this value is not available from within a webserver context (We're talking about Apache and IIS here) so that the problem of unquoted values will not go away. It also introduces the concept of URI encodings. I'm especially unhappy with this part. It would mean that implementations would have to follow the WSGI URI encoding if set. Most of the applications are using either latin1 or UTF-8 URLs, I would leave that including the decoding of *all* incoming data to the user. So yes, I'm all for definition #1 in the blog post where Graham says: > The first is that although WSGI 1.0 on Python 3.X should strictly be > bytes everywhere as per Definition #1, it is probably too late to > enforce this now. I don't think so. Reasoning: Python 3.0 does not work and is considered outdated, Python 3.1 might ship with a wsgiref that's against a revisioned spec, but cgi.FieldStorage is still broken there, making it impossible to use for anything but small applications. Regards, Armin [1]: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html [2]: http://bitbucket.org/ianb/wsgi-peps/ From armin.ronacher at active-4.com Thu Sep 17 18:40:20 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Thu, 17 Sep 2009 18:40:20 +0200 Subject: [Web-SIG] WSGI and async Servers Message-ID: <4AB26674.7050600@active-4.com> Hi, For this topic I would love to remember everybody that the web is currently changing and will even more change in the future which will probably also mean that a lot of what we're doing currently might not be common practise in the near future. WSGI is currently not doing to well for asyncronous applications, so people claim. I don't know where this is coming from, probably because everybody still thinks our data storages are traditional databases. But we really have to wake up from that idea and start at least *considering* asynchronous designs when it comes to WSGI. Tornado appeared recently and from a technical perspective, it's a step backwards. It's not supporting all of HTTP and it's clearly not supporting WSGI in any way beyond the very basics. But the interesting point is, that this does not matter for many applications. Even for an application that was never designed to be non-blocking that just recently dropped MySQL for most of the data, Tornado is a huge performance improvement (personal experience). Why would it be good to encourage async applications on top of WSGI? Because people would otherwise come up with their own implementations that are incompatible to each other. Maybe that should not go into WSGI but a AWSGI or whatever, but I'm pretty sure we should at least consider it and ask people that use asynchronous applications/servers what the issues with WSGI are. Regards, Armin From renesd at gmail.com Thu Sep 17 23:14:26 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Thu, 17 Sep 2009 22:14:26 +0100 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB25C72.50004@active-4.com> References: <4AB25C72.50004@active-4.com> Message-ID: <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> hi, I don't like yours and Ians changes with regard to cgi. cgi exists. Breaking wsgi apps on cgi is silly. Especially it is still the only way to run python web apps on some hosts. Even though many current python frameworks are not optimized enough to run on cgi, it is still used by people. I think you mean pre-2.2 support, not python 2.2? iterators came about in python 2.2. Fixing the python3 wsgi situation needs to happen very soon(it's been a year already!). I don't think delaying it any longer is a good idea for python 3 and for python as a whole. So making a separate wsgi version will not be good if a new wsgi comes out for python3. From armin.ronacher at active-4.com Thu Sep 17 23:49:42 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Thu, 17 Sep 2009 23:49:42 +0200 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> References: <4AB25C72.50004@active-4.com> <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> Message-ID: <4AB2AEF6.6080001@active-4.com> Hi, Ren? Dudfield schrieb: > I don't like yours and Ians changes with regard to cgi. cgi exists. > Breaking wsgi apps on cgi is silly. Can you give an example on where we break CGI compatibility? > I think you mean pre-2.2 support, not python 2.2? iterators came > about in python 2.2. That might be. That was before my time. I'm pretty sure the first Python version I used was 2.3, but don't quote me on that. > Fixing the python3 wsgi situation needs to happen very soon(it's been > a year already!). I don't think delaying it any longer is a good idea > for python 3 and for python as a whole. So making a separate wsgi > version will not be good if a new wsgi comes out for python3. I agree that WSGI for Python 3 has to be fixed, I'm just not yet convinced that Python 3 is what will be relevant anytime soon. From my current perspective there is still too much left unanswered in Python 3. Regards, Armin From ianb at colorstudy.com Fri Sep 18 00:57:13 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 17 Sep 2009 17:57:13 -0500 Subject: [Web-SIG] WSGI and async Servers In-Reply-To: <4AB26674.7050600@active-4.com> References: <4AB26674.7050600@active-4.com> Message-ID: On Thu, Sep 17, 2009 at 11:40 AM, Armin Ronacher wrote: > Why would it be good to encourage async applications on top of WSGI? > Because people would otherwise come up with their own implementations > that are incompatible to each other. ?Maybe that should not go into WSGI > but a AWSGI or whatever, but I'm pretty sure we should at least consider > it and ask people that use asynchronous applications/servers what the > issues with WSGI are. I think AWSGI would be most appropriate. There's too much going on, and trying to keep WSGI sane while allowing async is just too hard. If we fork, then people can get something that really works well, they can try it out with real applications, and then maybe we can look at something we know works and see if AWSGI/WSGI differences can be resolved to bring it back into one spec. And indeed it's quite possible at the library level that AWSGI could be supported by other libraries; I'm guessing for instance that WebOb would just require a few checks around the request body, and probably the response would work relatively fine (but for many patterns a normal response object would not be sufficient in an async context -- but that's fine too). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From ianb at colorstudy.com Fri Sep 18 01:01:53 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 17 Sep 2009 18:01:53 -0500 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB25C72.50004@active-4.com> References: <4AB25C72.50004@active-4.com> Message-ID: On Thu, Sep 17, 2009 at 10:57 AM, Armin Ronacher wrote: > I just want to point out that these are in no way final and are further > intended to only clarify some of the wrong wordings for Python 2, give > us a real readline() function on the input stream and get rid of useless > old cruft such as Python 2.2 support and Jython compatibility which no > longer appears to be a problem. To reiterate: people have complained that we've discussed non-controversial changes to WSGI, but the spec hasn't been updated. This was in large part, I think, because no one took the step going from discussion to actual proposed PEP changes. So these are some proposed changes, intended to be conservative. They are meant to be conservative, more like errata than a real revision, and to reflect current WSGI practice. If someone thinks one of the changes goes too far, then we can discuss -- I think we'll just be more constructive if we stick to concrete changes to the PEP so we can easily implement what we all agree on. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From renesd at gmail.com Fri Sep 18 09:01:45 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Fri, 18 Sep 2009 08:01:45 +0100 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB2AEF6.6080001@active-4.com> References: <4AB25C72.50004@active-4.com> <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> <4AB2AEF6.6080001@active-4.com> Message-ID: <64ddb72c0909180001l7587a02dic5ea47dc43aa57e8@mail.gmail.com> Hello, On Thu, Sep 17, 2009 at 10:48 PM, Armin Ronacher wrote: > Hi, > > Ren? Dudfield schrieb: >> I don't like yours and Ians changes with regard to cgi. cgi exists. >> Breaking wsgi apps on cgi is silly. > Can you give an example on where we break CGI compatibility? > >From this link: http://bitbucket.org/ianb/wsgi-peps/changeset/b51893478f9a/ It says "Because of this future revisions of WSGI will most likely switch away from a raw CGI environment to require the server to provide these values to be quoted and available on a different key." That sounds like wsgi is breaking cgi... or plans to break cgi. cgi is one of those things that just isn't going to die... it's still useful and used. I'm not sure if those changes actually break cgi or not... but that wording sounds like they do. Also on that link, why not explicitly state that python 2.x should use str or StringType there? (line 977). >> I think you mean pre-2.2 support, not python 2.2? iterators came >> about in python 2.2. > That might be. That was before my time. I'm pretty sure the first > Python version I used was 2.3, but don't quote me on that. It was definitely 2.2. So I think that needs to be changed in your changes - and related changes double checked. See http://docs.python.org/whatsnew/2.2.html > >> Fixing the python3 wsgi situation needs to happen very soon(it's been >> a year already!). I don't think delaying it any longer is a good idea >> for python 3 and for python as a whole. So making a separate wsgi >> version will not be good if a new wsgi comes out for python3. > I agree that WSGI for Python 3 has to be fixed, I'm just not yet > convinced that Python 3 is what will be relevant anytime soon. From my > current perspective there is still too much left unanswered in Python 3. > It looks like python3 issues are being addressed in your changes anyway. Work on python3 building blocks needs to happen soon otherwise it will hold up lots of people from even porting their code to python3. wsgi is one of the main things lagging behind in the python3 porting effort. I think both mod_wsgi and cherrypy have worked on issues with python3, and it seems like there is agreement there on most issues. So we have two implementations to play with, and people have more experience with python3 now... so we should be in a good position to get python3 related changes through quickly. Thanks for your work on this. cheers! From graham.dumpleton at gmail.com Fri Sep 18 09:15:45 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 17:15:45 +1000 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB25C72.50004@active-4.com> References: <4AB25C72.50004@active-4.com> Message-ID: <88e286470909180015p6ccd091ey7722a0b553cf22a0@mail.gmail.com> 2009/9/18 Armin Ronacher : > Hi, > > Graham mentioned that the WSGI development might further drift apart > based on the changes Ian Bicking and I did on DjangoCon in a separate hg > repository [1] for the WSGI PEP. > > I just want to point out that these are in no way final and are further > intended to only clarify some of the wrong wordings for Python 2, give > us a real readline() function on the input stream and get rid of useless > old cruft such as Python 2.2 support and Jython compatibility which no > longer appears to be a problem. > > My personal Idea would be making that PEP WSGI 1.1 and having a separate > one for Python 3. ?The reason for pushing up the number would be that > frameworks then can figure out if they have to safely process the input > stream because there is no useful readline function or not. > > [1]: http://bitbucket.org/ianb/wsgi-peps/ My concern over seeing the changes is that because no overall plan had been described by Ian or you as to where you were going in making the changes, I couldn't see what the end goal was going to be. Thus, didn't know whether you had a particular end point in mind, ie., a specific definition of how things should work, or whether you were just going to incrementally make changes and see what fell out of the process at the end. I guess I just find it hard to know what you are trying to do by reading individual changes. Your comment above about 'having a separate one (specification) for Python 3' also worries me a bit. That is sort of what I want to avoid. I would rather we try and use language such that a single specification would apply meaningfully to all Python versions. I acknowledge there will still end up being some subtle differences with how things will work between Python 2.X and Python 3.X, but I don't think it is enough to warrant a separate specification for Python 3.X, which in my mind would only confuse things. So, how about describing your overall master plan? :-) Graham From graham.dumpleton at gmail.com Fri Sep 18 09:19:41 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 17:19:41 +1000 Subject: [Web-SIG] Strings in Jython [Graham's WSGI for py3] In-Reply-To: <4AB25B55.9050608@active-4.com> References: <4AB25B55.9050608@active-4.com> Message-ID: <88e286470909180019x400d769br5f5f8d0be8bacba4@mail.gmail.com> 2009/9/18 Armin Ronacher : > Hi, > > This is my first reply in a list of replies for Grahams lengthy blog > post about WSGI 3 [1]. ?I break it up into multiple separate threads so > that this can be discussed easier. > >> What should be highlighted is that for Jython, as I understand it at >> least, when reading from a socket connection it returns a unicode >> string. That unicode string will only have characters in the range >> \u0000 through \u00FF, inclusive. Further, it is possible to transcode >> that unicode string without needing to go through a separate byte >> string type. > > On Jython 2.5 (the only one I tested) there is a 'str' and 'unicode' > type and sockets return strings. ?I can't see much difference to cpython > here. > > Is the Jython unicode issue really (still) relevant? > > I can see that IronPython has only one string type, but they are doing > fine handling binary data in their unicode? ones. > > [1]: > http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html For the record, I have never used Jython or IronPython so could only base what I said based on existing information in the PEP and past discussions in the Google Groups archive about WSGI when the specification being drafted. Thus, definitely need people familiar with those Python implementations to comment on whether what I was saying makes any sense at all. Graham From renesd at gmail.com Fri Sep 18 09:21:23 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Fri, 18 Sep 2009 08:21:23 +0100 Subject: [Web-SIG] WSGI and async Servers In-Reply-To: References: <4AB26674.7050600@active-4.com> Message-ID: <64ddb72c0909180021x101d5fe8hd1330a49bf4a1ee9@mail.gmail.com> I'm pretty sure you can use async sockets + wsgi with Eventlet. http://eventlet.net/ That shows it's possible to support wsgi with async servers. Eventlet is quite nice towards wsgi in this way. One of eventlets backends is twisted. From graham.dumpleton at gmail.com Fri Sep 18 09:56:23 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 17:56:23 +1000 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <4AB2634C.2070009@active-4.com> References: <4AB2634C.2070009@active-4.com> Message-ID: <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> 2009/9/18 Armin Ronacher : > Hi, > > Graham currently proposes[1] the following behaviors for Strings in WSGI > (Python version independent). ?However this mail only covers the Python > 3 part which I assume becomes a separate section in the PEP or even WSGI > version. > > Terminology: > > ?byte string == contains bytes > ?unicode string == contains unicode charpoints* > ?native string == what the python version uses a a string > ? ? ? ? ? ? ? ? ? (bytes in python 2, unicode in python 3) > > ?* ucs2 / ucs4 is ignored here. ?You might still have problems > ? ?with surrogate pairs in ucs2 python builds and jython. > >> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI >> environment, the value of the variable should be a native string. > > URLs in general are a tricky topic. ?For this particular field it does > not matter if we decide on bytes or unicode because it will always only > contain ASCII characters. ?This should be picked consistencly with the > type of PATH_INFO and SCRIPT_NAME. I believe it does matter and that it contains ASCII possibly doesn't mean it is somehow simpler. The reason is that URL reconstruction recipe as per WSGI PEP has to work. Ie., from urllib import quote url = environ['wsgi.url_scheme']+'://' if environ.get('HTTP_HOST'): url += environ['HTTP_HOST'] else: url += environ['SERVER_NAME'] if environ['wsgi.url_scheme'] == 'https': if environ['SERVER_PORT'] != '443': url += ':' + environ['SERVER_PORT'] else: if environ['SERVER_PORT'] != '80': url += ':' + environ['SERVER_PORT'] url += quote(environ.get('SCRIPT_NAME','')) url += quote(environ.get('PATH_INFO','')) if environ.get('QUERY_STRING'): url += '?' + environ['QUERY_STRING'] In Python 2.X you can concatenate byte strings and unicode strings: >>> 'http' + u'://' u'http://' In Python 3.X you cannot concatenate byte strings and unicode strings: >>> b'http'+'://' Traceback (most recent call last): File "", line 1, in TypeError: can't concat bytes to str On the basis that SCRIPT_NAME, PATH_INFO and QUERY_STRING when used by a user in Python 3.X were likely to be held as unicode strings, then saw wsgi.url_scheme needing to be of same type, albeit specified as native string so still byte string as we are accustomed to in Python 2.X now. This is also why all the other CGI variables are similarly make to be unicode strings. That is, so all the same type and stuff like URL reconstruction will work. If bytes is used, you could potentially end up with messy situations where you have to perform URL reconstruction as bytes, but then convert it to unicode strings to stuff it in as a parameter into some templating system where the template text is unicode. If SCRIPT_NAME, PATH_INFO and QUERY_STRING are in bytes form and they needed different encodings, how do you easily convert your bytes strings to the unicode string needed to stuff in the template. Can't see how you could, they really need to be in unicode if everything else in the system is going to be unicode. Or are templating systems now going to be expected to drop down and use bytes all the time as well. > However Graham moves further away from that in the rest of the blog post > because he wants to point out that people use WSGI directly and that > explicit bytestrings in Python 3 confuse people. ?The latest iteration > in the blog post is not to use bytestrings in a single location except > for headers and the input stream. Plus the response content would need to be bytes, albeit allowing an ISO-8859-1 fallback if unicode like other response items. The use of unicode exclusively is only really a big factor in WSGI environment variables. > I thought a lot about this in the past and I welcome the step to make > WSGI harder to use! ?This might sound absurd, but once encodings are > really explicit, people will think about it. ?I think we should > discourage *applications* written in WSGI and link to implementations in > the PEP. As a way of deterring a lot of users, making it harder to use, or at least making it more obvious that thought is required, would be quite effective. This would also be good in pushing people to use existing frameworks/toolkits which deal with all this stuff internally and hide it and instead present unicode strings at a higher level after doing everything correctly. So, it may well curtail the NIH issue that is becoming a problem, but am not sure that doing that and making it harder for users who want to work at that level, is a good idea. As others have pointed out, the likes of rack and jack, not sure about the new Perl variant, don't seem to have an issue with using unicode. > The big problems are always PATH_INFO and SCRIPT_NAME. ?Those are the > only values that are in the dict URL-decoded and might contain non-ASCII > characters. (except for headers, but that's a different story because > the only real-world problem there are cookie headers and those are > troubleing for more reasons than just character sets) > > My latest change to the WSGI sandbox hg repo [2] was that I added a > notice that later PEP revisions might document a RAW_SCRIPT_NAME or > something that contains the URL quoted values. ?It however turns out > that this value is not available from within a webserver context (We're > talking about Apache and IIS here) so that the problem of unquoted > values will not go away. I am still waiting for the good explanation of why access to the raw URL quoted values is so important. Can you please explain what the requirement is? The only example I recall was related to web servers eliminating repeating slashes thereby effectively not making it possible to have URLs in query strings with out a custom encoding string. Since there are alternatives, I don't find that alone a compelling argument. > It also introduces the concept of URI encodings. ?I'm especially unhappy > with this part. ?It would mean that implementations would have to follow > the WSGI URI encoding if set. No it doesn't. The whole point of providing wsgi.uri_encoding was so that a WSGI application would know the encoding so as to be able to reverse it to bytes and convert it to something else. Given that you accept below that most of the time latin1 or UTF-8 would be used, then the typical case would be handled automatically and so that transcoding wouldn't be required. > Most of the applications are using either > latin1 or UTF-8 URLs, I would leave that including the decoding of *all* > incoming data to the user. > > So yes, I'm all for definition #1 in the blog post where Graham says: > >> The first is that although WSGI 1.0 on Python 3.X should strictly be >> bytes everywhere as per Definition #1, it is probably too late to >> enforce this now. > I don't think so. ?Reasoning: Python 3.0 does not work and is considered > outdated, Python 3.1 might ship with a wsgiref that's against a > revisioned spec, but cgi.FieldStorage is still broken there, making it > impossible to use for anything but small applications. I'll summarise where people are falling in respect of which definition that want in a later post after more of the key figures have indicated their choices. Graham From renesd at gmail.com Fri Sep 18 10:12:38 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Fri, 18 Sep 2009 09:12:38 +0100 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> Message-ID: <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> On Fri, Sep 18, 2009 at 8:56 AM, Graham Dumpleton wrote: >> The big problems are always PATH_INFO and SCRIPT_NAME. ?Those are the >> only values that are in the dict URL-decoded and might contain non-ASCII >> characters. (except for headers, but that's a different story because >> the only real-world problem there are cookie headers and those are >> troubleing for more reasons than just character sets) >> >> My latest change to the WSGI sandbox hg repo [2] was that I added a >> notice that later PEP revisions might document a RAW_SCRIPT_NAME or >> something that contains the URL quoted values. ?It however turns out >> that this value is not available from within a webserver context (We're >> talking about Apache and IIS here) so that the problem of unquoted >> values will not go away. > > I am still waiting for the good explanation of why access to the raw > URL quoted values is so important. Can you please explain what the > requirement is? > > The only example I recall was related to web servers eliminating > repeating slashes thereby effectively not making it possible to have > URLs in query strings with out a custom encoding string. Since there > are alternatives, I don't find that alone a compelling argument. > Why is the raw url needed(very rarely)? Sometimes there are bugs. Access to the raw string lets you work around those bugs... if you need to. Dropping to a lower level is needed sometimes. Some APIs require you to send back an exact copy of the input url. Or sometimes you want to know what input url was used... not the cleaned up version of it. Sometimes clients calling the wsgi code will be buggy... and looking at the unquoted url is needed in those cases to work around buggy clients. From benoitc at couch.it Fri Sep 18 11:03:27 2009 From: benoitc at couch.it (Benoit Chesneau) Date: Fri, 18 Sep 2009 11:03:27 +0200 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> Message-ID: <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> On Sep 18, 2009, at 10:12 AM, Ren? Dudfield wrote: > > Why is the raw url needed(very rarely)? > > Sometimes there are bugs. Access to the raw string lets you work > around those bugs... if you need to. Dropping to a lower level is > needed sometimes. > > Some APIs require you to send back an exact copy of the input url. Or > sometimes you want to know what input url was used... not the cleaned > up version of it. Sometimes clients calling the wsgi code will be > buggy... and looking at the unquoted url is needed in those cases to > work around buggy clients. And sometimes you need to support full uri spec. For example %2F is different from / . Actually if all url is decoded you don't know if the client request was %2F or /, you just get a /. Which is annoying. It causes some problem with some api ,I'm thinking to couchdb for example who accept db name with a %2F inside to allow creation of folder on user system. - benoit -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Fri Sep 18 12:06:08 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 20:06:08 +1000 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> Message-ID: <88e286470909180306r231252ffs4ded0a82e7b516b4@mail.gmail.com> 2009/9/18 Ren? Dudfield : > On Fri, Sep 18, 2009 at 8:56 AM, Graham Dumpleton > wrote: >>> The big problems are always PATH_INFO and SCRIPT_NAME. ?Those are the >>> only values that are in the dict URL-decoded and might contain non-ASCII >>> characters. (except for headers, but that's a different story because >>> the only real-world problem there are cookie headers and those are >>> troubleing for more reasons than just character sets) >>> >>> My latest change to the WSGI sandbox hg repo [2] was that I added a >>> notice that later PEP revisions might document a RAW_SCRIPT_NAME or >>> something that contains the URL quoted values. ?It however turns out >>> that this value is not available from within a webserver context (We're >>> talking about Apache and IIS here) so that the problem of unquoted >>> values will not go away. >> >> I am still waiting for the good explanation of why access to the raw >> URL quoted values is so important. Can you please explain what the >> requirement is? >> >> The only example I recall was related to web servers eliminating >> repeating slashes thereby effectively not making it possible to have >> URLs in query strings with out a custom encoding string. Since there >> are alternatives, I don't find that alone a compelling argument. >> > > Why is the raw url needed(very rarely)? > > Sometimes there are bugs. ?Access to the raw string lets you work > around those bugs... if you need to. ?Dropping to a lower level is > needed sometimes. > > Some APIs require you to send back an exact copy of the input url. > Or sometimes you want to know what input url was used... not the cleaned > up version of it. What APIs? Can we have some concrete examples in common use rather than theoretical possibilities? > Sometimes clients calling the wsgi code will be > buggy... and looking at the unquoted url is needed in those cases to > work around buggy clients. Bugs in WSGI adapters aren't a good reason for why it is needed. If the WSGI adapters are broken, fix the WSGI adapters. Graham From graham.dumpleton at gmail.com Fri Sep 18 12:21:34 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 20:21:34 +1000 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> Message-ID: <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> 2009/9/18 Benoit Chesneau : > And sometimes you need to support full uri spec. For example %2F is > different from / . Actually if all url is decoded you don't know if the > client request was %2F or /, you just get a /. Which is annoying. It causes > some problem with some api ,I'm ?thinking to couchdb for example who accept > db name with a %2F inside to allow creation of folder on user system. Which happens because of the way the HTTP URL processing rules says it has to be done. Are there any other real world examples besides repeating slashes and slash encoding issues? Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go direct to REQUEST_URI all come down to these slash encoding and path normalising issues? Graham From graham.dumpleton at gmail.com Fri Sep 18 12:48:46 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 20:48:46 +1000 Subject: [Web-SIG] WSGI and async Servers In-Reply-To: <4AB26674.7050600@active-4.com> References: <4AB26674.7050600@active-4.com> Message-ID: <88e286470909180348s7ba40defua0527081d154e82e@mail.gmail.com> 2009/9/18 Armin Ronacher : > Hi, > > For this topic I would love to remember everybody that the web is > currently changing and will even more change in the future which will > probably also mean that a lot of what we're doing currently might not be > common practise in the near future. > > WSGI is currently not doing to well for asyncronous applications, so > people claim. ?I don't know where this is coming from, probably because > everybody still thinks our data storages are traditional databases. ?But > we really have to wake up from that idea and start at least > *considering* asynchronous designs when it comes to WSGI. > > Tornado appeared recently and from a technical perspective, it's a step > backwards. ?It's not supporting all of HTTP and it's clearly not > supporting WSGI in any way beyond the very basics. ?But the interesting > point is, that this does not matter for many applications. ?Even for an > application that was never designed to be non-blocking that just > recently dropped MySQL for most of the data, Tornado is a huge > performance improvement (personal experience). > > Why would it be good to encourage async applications on top of WSGI? > Because people would otherwise come up with their own implementations > that are incompatible to each other. ?Maybe that should not go into WSGI > but a AWSGI or whatever, but I'm pretty sure we should at least consider > it and ask people that use asynchronous applications/servers what the > issues with WSGI are. Let me clearly state that I am not against the concept of asynchronous or event driven systems. In my 20+ years of coding I have done more work in the area of event driven systems than in other areas. My work on Apache/mod_wsgi and mod_python before that are merely hobbies in comparison. My bread and butter has been distributed messaging and publish/subscribe systems based on event driven systems running across large networks of hosts and sites for building complex real time applications. What I simply don't want is for the asynchronous issue to stop us again from sorting out the synchronous WSGI specification. Let us just deal with it later rather than it once again becoming a distraction. FWIW, one thing I am against with event driven systems is those which are poorly implemented. I also get annoyed when people make claims for event driven systems that are somewhat tenuous. Although event driven systems can be good for some things, they have to be used properly. Trying to adapt them in ways they shouldn't can cause subtle problems. Often people pushing event driven systems either don't understand the potential problems, or want to gloss over them in some way. Trying to bolt a synchronous WSGI directly on top an event driven systems, particular in a multi process web server is a good example for potential problems as I have blogged about in the past in relation to nginx/mod_wsgi. You will get the same sort of potential issues with Tornado depending on how they try to use it in conjunction with WSGI. If we ever actually finalise synchronous WSGI and I can get some measure of closure on Apache/mod_wsgi in as much as it being as feature complete as worth pursuing, then an event driven based web serving mechanism for Python applications and associated static files is certainly an area I am interested in looking at. I already have my own ideas for how I would go about doing it and it isn't like what people are doing now. With the sort of mix of technologies I have in mind I see no reason why it wouldn't perform better than the systems being pushed in the Python world at present. So, hurry up and work out this synchronous stuff and maybe I can get back on to the event driven system, which frankly I find more interesting anyway. :-) Graham From armin.ronacher at active-4.com Fri Sep 18 13:03:14 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Fri, 18 Sep 2009 13:03:14 +0200 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <64ddb72c0909180001l7587a02dic5ea47dc43aa57e8@mail.gmail.com> References: <4AB25C72.50004@active-4.com> <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> <4AB2AEF6.6080001@active-4.com> <64ddb72c0909180001l7587a02dic5ea47dc43aa57e8@mail.gmail.com> Message-ID: <4AB368F2.4080705@active-4.com> Hi, Ren? Dudfield schrieb: > It says "Because of this future revisions of WSGI will most likely > switch away from a raw CGI environment to require the server to > provide these values to be quoted and available on a different key." This information would be additional information of course! > Also on that link, why not explicitly state that python 2.x should use > str or StringType there? (line 977). Probably a good idea. Once I'm sure that this i no longer an issue, I will add that. > It was definitely 2.2. So I think that needs to be changed in your > changes - and related changes double checked. > See http://docs.python.org/whatsnew/2.2.html Will do. > It looks like python3 issues are being addressed in your changes anyway. But it should be discussed separately and then be integrated. The changes in the PEP currently reflect #1 of Graham's proposal. Regards, Armin From armin.ronacher at active-4.com Fri Sep 18 13:30:44 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Fri, 18 Sep 2009 13:30:44 +0200 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> Message-ID: <4AB36F64.2040604@active-4.com> Hi, Graham Dumpleton schrieb: > I believe it does matter and that it contains ASCII possibly doesn't > mean it is somehow simpler. The reason is that URL reconstruction > recipe as per WSGI PEP has to work. Ie. > *snip* That of course will not work and is not something we should aim for. There is a lot of stuff that will break as well, and libraries are supposed to fix that on the 2.x -> 3.x transition. Actually in 2.6 you can use bytestring literals that will fix that problem for you. The only problem left is wsgi.url_scheme and for that one just have to use an explicit .encode() call. No big deal. > This is also why all the other CGI variables are similarly make to be > unicode strings. That is, so all the same type and stuff like URL > reconstruction will work. In an ideal world, maybe. But the only thing more evil than UnicodeErrors are silent encoding errors that are hard to track down. (What just destroyed my charset information? Oh, it was the WSGI gateway in combination with an ancient internet explorer version) > If bytes is used, you could potentially end up with messy situations > where you have to perform URL reconstruction as bytes, but then > convert it to unicode strings to stuff it in as a parameter into some > templating system where the template text is unicode. URLs are ASCII only, IRIs are not. If you are working with Python 3 you would probably start using IRIs internally after a while because "it makes sense". > If SCRIPT_NAME, PATH_INFO and QUERY_STRING are in bytes form and they > needed different encodings, how do you easily convert your bytes > strings to the unicode string needed to stuff in the template. Can't > see how you could, they really need to be in unicode if everything > else in the system is going to be unicode. Or are templating systems > now going to be expected to drop down and use bytes all the time as > well. I still defend my point that charsets are a complex topic and it's the framework / library that should deal with that. WebOb does, Werkzeug does, Django does, I'm sure web.py and other libraries do to. If one wants to shoot himself into the foot by implementing his own library based on WSGI we should not stop him. > As a way of deterring a lot of users, making it harder to use, or at > least making it more obvious that thought is required, would be quite > effective. > > This would also be good in pushing people to use existing > frameworks/toolkits which deal with all this stuff internally and hide > it and instead present unicode strings at a higher level after doing > everything correctly. I like that idea a lot :) > As others have pointed out, the likes of rack and jack, not sure about > the new Perl variant, don't seem to have an issue with using unicode. Ruby does not use unicode internally, it uses encoding marked strings. That is, a string comes in and is iso-8859-15, it's marked as such and ruby knows how to deal with it. As far as I know Rack does not specify charsets at all which probably means that it's up to the implementaiton to decide what to use. Rack will have the problem with charsets soon enough, they just don't care about unicode enough (yet?). > I am still waiting for the good explanation of why access to the raw > URL quoted values is so important. Can you please explain what the > requirement is? Knowing the difference between "foo/bar" and "foo%2fbar" I guess. To be humble, I never had the problem, but apparently some other people are. And of course that you suddenly have non ASCII stuff in a dict value ;) > The only example I recall was related to web servers eliminating > repeating slashes thereby effectively not making it possible to have > URLs in query strings with out a custom encoding string. Since there > are alternatives, I don't find that alone a compelling argument. I don't need unquoted strings, I just think it would make sense to have them *if possible*. Regards, Armin From renesd at gmail.com Fri Sep 18 13:45:26 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Fri, 18 Sep 2009 12:45:26 +0100 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> Message-ID: <64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> On Fri, Sep 18, 2009 at 11:21 AM, Graham Dumpleton wrote: > 2009/9/18 Benoit Chesneau : >> And sometimes you need to support full uri spec. For example %2F is >> different from / . Actually if all url is decoded you don't know if the >> client request was %2F or /, you just get a /. Which is annoying. It causes >> some problem with some api ,I'm ?thinking to couchdb for example who accept >> db name with a %2F inside to allow creation of folder on user system. > > Which happens because of the way the HTTP URL processing rules says it > has to be done. > > Are there any other real world examples besides repeating slashes and > slash encoding issues? > > Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go > direct to REQUEST_URI all come down to these slash encoding and path > normalising issues? > hello again, No, slash encoding and normalising are not the only issues. As mentioned before sometimes you need the exact bytes. 1. buggy clients. If a client sends something that doesn't work correctly, you can still sometimes make sense of it in the raw version of the url. 2. client APIs that require the server to know the exact url. 3. buggy servers that don't do their job properly. 4. extensibility. A url scheme changes a tiny bit, and you want to support the change. Having the raw url allows you do to support it on old servers. In all APIs it's handy to go to lower levels when the higher levels don't work right. Especially when wsgi only handles one side of things, and urls are can be generated by anything. cheers, From alan at xhaus.com Fri Sep 18 13:51:33 2009 From: alan at xhaus.com (Alan Kennedy) Date: Fri, 18 Sep 2009 12:51:33 +0100 Subject: [Web-SIG] WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB2AEF6.6080001@active-4.com> References: <4AB25C72.50004@active-4.com> <64ddb72c0909171414s5bd134f4wac286b67ad067ed4@mail.gmail.com> <4AB2AEF6.6080001@active-4.com> Message-ID: <4a951aa00909180451h5d16830era462a6550e8d13fe@mail.gmail.com> [Rene] >> I think you mean pre-2.2 support, not python 2.2? ?iterators came >> about in python 2.2. [Armin] > That might be. ?That was before my time. ?I'm pretty sure the first > Python version I used was 2.3, but don't quote me on that. As WSGI was being developed, cpython was at version 2.3. The only reason that support for "older versions" was in the spec was because jython was at version 2.1 at the time. The WSGI spec was made much simpler by the use of the iterator protocol (PEP 234), which was in introduced into the language in 2.2. So where the spec says "Supporting Older (<2.2) Versions of Python" It should probably have read "Supporting Older (pre-pep-234-iterator-protocol) Versions of Python" I don't know of any modern python implementation that doesn't support the iterator protocol. It's probably time to drop that section from the PEP. Alan. From graham.dumpleton at gmail.com Fri Sep 18 13:55:40 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 18 Sep 2009 21:55:40 +1000 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> <64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> Message-ID: <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> 2009/9/18 Ren? Dudfield : > On Fri, Sep 18, 2009 at 11:21 AM, Graham Dumpleton > wrote: >> 2009/9/18 Benoit Chesneau : >>> And sometimes you need to support full uri spec. For example %2F is >>> different from / . Actually if all url is decoded you don't know if the >>> client request was %2F or /, you just get a /. Which is annoying. It causes >>> some problem with some api ,I'm ?thinking to couchdb for example who accept >>> db name with a %2F inside to allow creation of folder on user system. >> >> Which happens because of the way the HTTP URL processing rules says it >> has to be done. >> >> Are there any other real world examples besides repeating slashes and >> slash encoding issues? >> >> Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go >> direct to REQUEST_URI all come down to these slash encoding and path >> normalising issues? >> > > hello again, > > No, slash encoding and normalising are not the only issues. > > As mentioned before sometimes you need the exact bytes. > > 1. buggy clients. ?If a client sends something that doesn't work > correctly, you can still sometimes make sense of it in the raw version > of the url. > 2. client APIs that require the server to know the exact url. > 3. buggy servers that don't do their job properly. > 4. extensibility. ?A url scheme changes a tiny bit, and you want to > support the change. ?Having the raw url allows you do to support it on > old servers. > > In all APIs it's handy to go to lower levels when the higher levels > don't work right. ?Especially when wsgi only handles one side of > things, and urls are can be generated by anything. This is where it all comes down to me not have the real world experience in writing web applications to know best. What I would like to hear is PJE (who tends towards #3) and Robert Brewer (who tends towards #4). Can you guys give counter explanations as to why there arguments for bytes isn't valid. Ian, I don't think you have yet expressed your leaning, but would like to here your point as well. On top of the issues above, Armin believes 2to3 gives better results where bytes everywhere interpretation is used. Has anyone else actually tried 2to3 and have the experience with it? Graham From armin.ronacher at active-4.com Fri Sep 18 13:58:58 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Fri, 18 Sep 2009 13:58:58 +0200 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> <64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> Message-ID: <4AB37602.2030207@active-4.com> Hi, Graham Dumpleton schrieb: > On top of the issues above, Armin believes 2to3 gives better results > where bytes everywhere interpretation is used. Has anyone else > actually tried 2to3 and have the experience with it? You slightly misquoted me. I said that 2to3 gives good results on high level transformations (eg, a django app between 2 and 3) because both "foo" and u"foo" becomes "foo". Werkzeug, WebOb, Django all use unicode by default, so the application will not notice any changes. That would not change if we would have unicode in the WSGI dict and the framework would be changed to treat it properly and do a encode/decode dance if necessary. The reason I brought it up is that 2to3 does not work at all on the raw WSGI layer currently because it converts bytes to unicode which in my opinion is just wrong. Regards, Armin From armin.ronacher at active-4.com Fri Sep 18 14:06:48 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Fri, 18 Sep 2009 14:06:48 +0200 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <4AB2634C.2070009@active-4.com> References: <4AB2634C.2070009@active-4.com> Message-ID: <4AB377D8.7070905@active-4.com> Hi, Let me backup a bit here. We have to focus on two difference use cases for WSGI on Python 3. The one is the application that should continue to work on Python 3, the other one is the application that was designed for Python 3. In both cases let's just assume that this application is using WebOb/Werkzeug/Django or whatever library is in use. 2to3 converts "foo" and u"foo" to "foo". However in Python 3 "foo" is unicode, so that's fine if the library exposes unicode data only. This is the case for all the frameworks and libraries. Template engines, database adapters, frameworks, they all use unicode internally which is great. If the WSGI server figures out charsets or the library, the data forwarded to the application is always unicode. So what would we gain from doing the decoding in the server? On the bright side, 2to3 would probably start working for some raw WSGI applications but would still break many. On the other hand, the frameworks would still have to perform encoding detection for stuff like multipart or form encoded form data. Even worse: they would have to apply different decode rules for form data and stuff like path info. It already caused confusion that path info was unquoted in the past with many people quoting that value, it would be even worse in the future if path info was proper unicode, query string looked like unicode but is actually url encoded data with a different encoding etc. I can see some major confusion coming up there, and it would not remove any complexity for real-world implementations of WSGI. Regards, Armin From ianb at colorstudy.com Fri Sep 18 19:02:14 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 18 Sep 2009 12:02:14 -0500 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> Message-ID: On Fri, Sep 18, 2009 at 2:56 AM, Graham Dumpleton wrote: > As others have pointed out, the likes of rack and jack, not sure about > the new Perl variant, don't seem to have an issue with using unicode. I looked up Jack and Rack: http://jackjs.org/jsgi-spec.html and http://rack.rubyforge.org/doc/files/SPEC.html They don't have an issue with unicode because they don't mention it and don't specify anything at all. Basically they punt on the issue. In the specific case, most things in Javascript have to be unicode. The response body iterator must have items that respond to toByteString, which includes String and Binary. I'm assuming Strings always use UTF8 in Javascript, as JSON acts that way. jsgi.input is only specified as an "input stream", which is very unspecified. Especially since jsgi.errors is an "output stream", though presumably one should be binary and the other text. Ruby's unicode is kind of funny (as I understand it), in a way that might help them. Strings are stored as binary with an attached encoding. So there's no "unicode", only binary strings with encodings; so you can change the encoding, or transcoding happens implicitly when you combine strings from different encodings. So basically there's no mention of unicode because they've dodged that whole bullet. But it also seems to be unspecified what encoding might be attached to strings, if any at all. Another example, neither spec even indicates if SCRIPT_NAME/PATH_INFO are url-decoded (or that they aren't decoded). So, in summary: I don't see anything we can learn from these specs, and there's no reason we should feel like we've somehow been leapfrogged, instead these other specifications are underspecified. I also think on Web-SIG we are approaching this with more robust and general applications in mind than for Jack and Rack -- for instance, I would like WSGI to be a reasonable basis for an HTTP proxy, where you can't enforce UTF8-everywhere. If all we wanted for WSGI was to be a layer for serving monolithic applications then these issues wouldn't be so important. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From renesd at gmail.com Fri Sep 18 20:44:45 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Fri, 18 Sep 2009 19:44:45 +0100 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] Message-ID: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> On Fri, Sep 18, 2009 at 12:03 PM, Armin Ronacher wrote: >> It looks like python3 issues are being addressed in your changes anyway. > But it should be discussed separately and then be integrated. ?The > changes in the PEP currently reflect #1 of Graham's proposal. > yeah cool. Here's a new thread for the python3 related changes... Perhaps a good way to test that, is to make a smallish example wsgi program to port to python3, using the various proposals... or the proposal most liked. Then we could see how easy it would be to port to a given implementation that supports that proposal. I'm not sure which of the proposals Grahams mod_wsgi branch is for... or for the cherrypy branch... but those ones would be easier to test since they're already done. So is there a smallish tested wsgi example around to port? Maybe some of the cherrypy example programs would be a good one to do... if they aren't ported already (if they are ported already... great!) cheers, From armin.ronacher at active-4.com Fri Sep 18 20:51:59 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Fri, 18 Sep 2009 20:51:59 +0200 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] In-Reply-To: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> References: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> Message-ID: <4AB3D6CF.1010008@active-4.com> Hi, Ren? Dudfield schrieb: > Perhaps a good way to test that, is to make a smallish example wsgi > program to port to python3, using the various proposals... or the > proposal most liked. Not a good idea. Because a small WSGI application directly on top of WSGI behaves completely different than a big WSGI application on top of an existing system. The interfaces the implementations (WebOb, Werkzeug, Django) expose would not change either way because they are already unicode aware. 2to3 would go the unicode way because that's what it was written for. But that is also the one that causes the most problems. > Then we could see how easy it would be to port to a given > implementation that supports that proposal. I'm not sure which of the > proposals Grahams mod_wsgi branch is for... or for the cherrypy > branch... but those ones would be easier to test since they're already > done. A WSGI Server that is byte only based on a simple one like wsgiref can be written in a couple of minutes. You just have to take the existing sources and make sure a b is in front of all strings that should be byte strings. Regards, Armin From pje at telecommunity.com Sat Sep 19 00:07:57 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 18 Sep 2009 18:07:57 -0400 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets Message-ID: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> On his blog, Graham mentioned some skepticism about skipping WSGI 1.1 and going straight to 2.0, due to concern that people using write() would need to make major code changes to go to WSGI 2.0. Now, if we ignore the part of the spec that says "New WSGI applications and frameworks *should not* use the write() callable if it is possible to avoid doing so," there does need to be some reasonable way for those people to make their apps work with the newer spec. On CPython at least, this can be implemented using greenlets, and on other Python implementations it could be done with threads. Here's a quick and dirty, untested sketch (no error checking, no version handling) of how it could be done with greenlets: def two_to_one(app): def wrapper(environ): buffer = [] header = [] def start_response(status, headers): header.append(status) header.append(headers) return write def write(data): buffer.append(data) greenlet.getcurrent().parent.switch(None) child = greenlet(app) response = child.switch(environ, start_response) if not header: # XXX start_response wasn't called, error! if not buffer: # write wasn't called, so just pass it through return header[0], header[1], response def yield_all(): response = None try: while buffer: yield buffer.pop(0) response = child.switch() # XXX check for response being non-empty finally: if hasattr(response, 'close'): response.close() return header[0], header[1], yield_all() return wrapper As you can see, I've stuck in some XXX comments for where there should be more error checking or handling, and there would probably be some other additions as well. However, this adapter handles both write()-using and non-write()-using WSGI 1 apps, and converts them to the WSGI 2 calling convention, by making the write() function call perform a non-local return from the application. Doing this with threads would be similar, but there are more design decisions to make, i.e., will you use a single worker thread that you send requests to, or just start a new thread for each request? In either case, the start_response() and write() in that thread would simply write data to a Queue.Queue that's read by the adapter. The code running in the other thread would handle closing the app's response (if need be), after piping all the app's output to the queue. You'd also need to decide if you're going to support interrupting the application (e.g. by returning an error from write(), or by calling throw() on a generator) if the wrapper is closed before its time. (Of course, none of these shenanigans are necessary for well-behaved apps and frameworks that don't use write(); the above adapter would lose its yield_all function and all the greenlet usage, substituting some error raise code for the body of write() in that case.) From ianb at colorstudy.com Sat Sep 19 01:58:22 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 18 Sep 2009 18:58:22 -0500 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> Message-ID: On Fri, Sep 18, 2009 at 5:07 PM, P.J. Eby wrote: > On his blog, Graham mentioned some skepticism about skipping WSGI 1.1 and > going straight to 2.0, due to concern that people using write() would need > to make major code changes to go to WSGI 2.0. I'm not entirely clear why this is such a big deal. Here's how I'd implement a WSGI 2 wrapper around a WSGI 1 app: def wsgi1to2(app): def new_app(environ): written = [] status_headers = [] def start_response(status, headers, exc_info=None): if exc_info is not None: raise exc_info[0], exc_info[1], exc_info[2] status_headers[:] = [status, headers] return written.append app_iter = app(environ, start_response) if not status_headers: app_iter = iter(app_iter) written.append(app_iter.next()) assert status_headers if written: app_iter = itertools.chain(written, app_iter) return status_headers[0], status_headers[1], app_iter What's wrong with this simpler approach to the conversion? -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From armin.ronacher at active-4.com Sat Sep 19 02:08:03 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 02:08:03 +0200 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> Message-ID: <4AB420E3.70502@active-4.com> Hi, P.J. Eby schrieb: > newer spec. On CPython at least, this can be implemented using > greenlets, and on other Python implementations it could be done with > threads. Here's a quick and dirty, untested sketch (no error > checking, no version handling) of how it could be done with greenlets: greenlets are one solution, but I don't think there are any applications out there using write() that are worth supporting in WSGI 2.0. Such applications should rather use an internal buffer and write to that. Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 02:09:29 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 02:09:29 +0200 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> Message-ID: <4AB42139.8020802@active-4.com> Hi, Ian Bicking schrieb: > What's wrong with this simpler approach to the conversion? It buffers, you can no longer do this: request.write('processing data') request.flush() ... request.write('data processed') request.flush() But that's not too common and people should rather rewrite their applications to use generators for these cases. Regards, Armin From ianb at colorstudy.com Sat Sep 19 02:40:01 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 18 Sep 2009 19:40:01 -0500 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: <4AB42139.8020802@active-4.com> References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> <4AB42139.8020802@active-4.com> Message-ID: On Fri, Sep 18, 2009 at 7:09 PM, Armin Ronacher wrote: > Ian Bicking schrieb: >> What's wrong with this simpler approach to the conversion? > It buffers, you can no longer do this: > > ? request.write('processing data') > ? request.flush() > ? ... > ? request.write('data processed') > ? request.flush() > > But that's not too common and people should rather rewrite their > applications to use generators for these cases. Yes -- I don't think many (any?) people use this particular technique, though many people use the start_response writer simply because it was there and it seemed like a good idea. I even used it a few times because it was easier to code for some circumstances (e.g., paste.cgiapp) but not because I expected it would immediately be pushed to the client. (appengine's webapp framework uses it a lot, not entirely sure why; not for streaming though -- maybe because it pushes the bytes out of the Python interpreter and into the parent process faster) So, I'm just saying we need to handle the start_response writer, because people have used it, but I'm not aware of people using it for its intended purpose. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From pje at telecommunity.com Sat Sep 19 07:03:16 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sat, 19 Sep 2009 01:03:16 -0400 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> Message-ID: <20090919050311.6DF723A403D@sparrow.telecommunity.com> At 06:58 PM 9/18/2009 -0500, Ian Bicking wrote: >What's wrong with this simpler approach to the conversion? It's not compliant with the WSGI 1 spec, which calls for write() to be unbuffered. On the one hand, you could say that anybody who gives a crap about the spec wouldn't use write() to begin with. But then, on the other, if we ignore the spec ourselves, we're hardly in a position to complain about their behavior. ;-) Anyway, Graham raised the difficulty of making a compliant adapter as an argument for having a WSGI 1.1 rather than jumping straight to 2.0, and I just wanted to show that it's not that difficult in principle to make a fully WSGI 1.0-compliant 2-to-1 adapter, at least if you cheat and use greenlets to handle the less well-behaved WSGI 1 apps. The hairiest bits of defining 2.0 have more to do with nailing down the whole bytes/unicode/native circus, the input stream API, etc... most of which I hope we can do in the errata for 1.0. From pje at telecommunity.com Sat Sep 19 07:06:13 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sat, 19 Sep 2009 01:06:13 -0400 Subject: [Web-SIG] Sketching a WSGI 2-to-1 adapter with greenlets In-Reply-To: <4AB420E3.70502@active-4.com> References: <20090918220753.82FDD3A4079@sparrow.telecommunity.com> <4AB420E3.70502@active-4.com> Message-ID: <20090919050607.B7FD53A403D@sparrow.telecommunity.com> At 02:08 AM 9/19/2009 +0200, Armin Ronacher wrote: >greenlets are one solution, but I don't think there are any applications >out there using write() that are worth supporting in WSGI 2.0. Such >applications should rather use an internal buffer and write to that. If an internal buffer was suitable to their application, they shouldn't have been using write() in the first place; it would suffice to "return [buffer]". Unfortunately, many people seem to think that yield and write() are for returning pieces of a normal web page, rather than for doing server push and streaming files that are too big to go all-at-once. From pje at telecommunity.com Sat Sep 19 07:07:15 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sat, 19 Sep 2009 01:07:15 -0400 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180306r231252ffs4ded0a82e7b516b4@mail.gmail.co m> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <88e286470909180306r231252ffs4ded0a82e7b516b4@mail.gmail.com> Message-ID: <20090919050709.793253A403D@sparrow.telecommunity.com> At 08:06 PM 9/18/2009 +1000, Graham Dumpleton wrote: > > Sometimes clients calling the wsgi code will be > > buggy... and looking at the unquoted url is needed in those cases to > > work around buggy clients. > >Bugs in WSGI adapters aren't a good reason for why it is needed. If >the WSGI adapters are broken, fix the WSGI adapters. "client" = "HTTP client" = browser/web spider/other script. From mdipierro at cs.depaul.edu Sat Sep 19 10:37:22 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Sat, 19 Sep 2009 03:37:22 -0500 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] In-Reply-To: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> References: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> Message-ID: <4586E9BC-2709-4697-A94B-CE5C787257C3@cs.depaul.edu> I liked your idea very much Rene' , so I made this: http://web2py.com/examples/static/sneaky.py and a Python 3.0 version: http://web2py.com/examples/static/sneaky3.py They both may need some testing more testing but I tried the former with web2py and it works well, including streaming. Massimo On Sep 18, 2009, at 1:44 PM, Ren? Dudfield wrote: > On Fri, Sep 18, 2009 at 12:03 PM, Armin Ronacher > wrote: >>> It looks like python3 issues are being addressed in your changes >>> anyway. >> But it should be discussed separately and then be integrated. The >> changes in the PEP currently reflect #1 of Graham's proposal. >> > > yeah cool. Here's a new thread for the python3 related changes... > > Perhaps a good way to test that, is to make a smallish example wsgi > program to port to python3, using the various proposals... or the > proposal most liked. > > Then we could see how easy it would be to port to a given > implementation that supports that proposal. I'm not sure which of the > proposals Grahams mod_wsgi branch is for... or for the cherrypy > branch... but those ones would be easier to test since they're already > done. > > So is there a smallish tested wsgi example around to port? Maybe some > of the cherrypy example programs would be a good one to do... if they > aren't ported already (if they are ported already... great!) > > > cheers, > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu From armin.ronacher at active-4.com Sat Sep 19 10:55:37 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 10:55:37 +0200 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4586E9BC-2709-4697-A94B-CE5C787257C3@cs.depaul.edu> References: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> <4586E9BC-2709-4697-A94B-CE5C787257C3@cs.depaul.edu> Message-ID: <4AB49C89.5070608@active-4.com> Hi, Massimo Di Pierro schrieb: > I liked your idea very much Rene' , so I made this Can you please stop that before you do any more damage? Your code is not even anywhere close to what was discussed and has tons of errors and ugly bits and pieces in there. Again. An example does not bring us anything because we already know the implications of each proposal. Regards, Armin From renesd at gmail.com Sat Sep 19 11:10:15 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 10:10:15 +0100 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB49C89.5070608@active-4.com> References: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> <4586E9BC-2709-4697-A94B-CE5C787257C3@cs.depaul.edu> <4AB49C89.5070608@active-4.com> Message-ID: <64ddb72c0909190210i1cb71c77p78a650d7b960d51e@mail.gmail.com> On Sat, Sep 19, 2009 at 9:55 AM, Armin Ronacher wrote: > Hi, > > Massimo Di Pierro schrieb: >> I liked your idea very much Rene' , so I made this > Can you please stop that before you do any more damage? ?Your code is > not even anywhere close to what was discussed and has tons of errors and > ugly bits and pieces in there. > > Again. ?An example does not bring us anything because we already know > the implications of each proposal. > Hi, I'm not sure 'we' in this case is correct. Not everyone understands *all* the implications of *all* the proposals... otherwise we would have already decided what to do. Well, at least I don't understand things yet... and am interested in knowing more, if you'd indulge. Concrete examples let people understand things more easily, and let us talk about specific things rather than abstractly. I think P.J Ebys sketch in the other thread is a good example of showing how things could work. cheers, From renesd at gmail.com Sat Sep 19 12:33:06 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 11:33:06 +0100 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> References: <4AB2634C.2070009@active-4.com> <88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com> <64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com> <01AADF93-8D07-46E1-81EC-32B51657538A@couch.it> <88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com> <64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> Message-ID: <64ddb72c0909190333w15e26b59xde0f201fb15d5141@mail.gmail.com> On Fri, Sep 18, 2009 at 12:55 PM, Graham Dumpleton wrote: > What I would like to hear is PJE (who tends towards #3) and Robert > Brewer (who tends towards #4). Can you guys give counter explanations > as to why there arguments for bytes isn't valid. Ian, I don't think > you have yet expressed your leaning, but would like to here your point > as well. > > On top of the issues above, Armin believes 2to3 gives better results > where bytes everywhere interpretation is used. Has anyone else > actually tried 2to3 and have the experience with it? > > Graham Here's a small wsgi server converted in the other thread. I've also applied 2to3 to it so you can see what it does. Below are links to diffs as well. Note, this doesn't show the following things converted with 2to3: - wsgi application. - an application from a framework layered on top(eg cherrypy). - wsgi middleware. sneaky.py - original python 2.x wsgi 1.0 server. http://pastebin.com/f5c2cdd3b sneaky3.py - conversion done by hand. http://pastebin.com/f7ae33d81 sneaky3_from2to3.py - conversion from 2to3 (python 3.1 version of the script) http://pastebin.com/f62a7d83a (diffs for your comparison). sneaky_2to3.diff - a diff from sneaky.py and the 2to3 tool applied. http://pastebin.com/f6d0430fa sneaky_sneaky3.diff - a diff from sneaky.py and sneak3.py http://pastebin.com/f23cadbb0 cu, From armin.ronacher at active-4.com Sat Sep 19 12:40:48 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 12:40:48 +0200 Subject: [Web-SIG] Unicode in Python 3 Message-ID: <4AB4B530.7080000@active-4.com> Hi, I spent the last few hours now figuring out what decisions Python took in the standard library to get a better understanding of unicode in Python 3 and how it affects web applications. Let's sum up the current state of encodings in the web world: RFC 2616 specifies the header encoding as "latin1" (or iso-8859-1). The majority of header values is ASCII only, the only exception except for custom headers and stuff like the server name, are the cookie headers. Cookie headers are problematic for other reasons as well because some browsers (IE for example) have different ideas of cookies than others. I've seen many people using utf-8 encoded cookie values, so it's pretty common to have headers with values outside the latin1 range. However to remind everybody: latin1 can carry invalid encoded utf-8 without loss of precision if you do the encode/decode dance. For URIs/IRIs there is a bit of a problem as well. URLs are encodingless but limited to ASCII. Values outside of the ASCII range have to be %-encoded, but nowhere is the charset specified. Browsers changed the URL encoding behavior to utf-8 a few years ago (I think with Firefox 1.5 or Firefox 2, Mozilla changed it). They are still trying latin1 as well if they are totally clueless and get a 404 or something. I'm not exactly sure how that is supposed to work. The new thing are IRIs. They can contain any non-ASCII characters and are considered being UTF-8. It is possible to quote utf-8 encoded charpoints with %-encoding. IRIs might also contain unicode identifiers for the hostname, for URIs this appears to be idna/puny encoded. Eg: IRI: http://?ser:p?ssword@?.net/p?th URI: http://%C3%BCser:p%C3%A4ssword at xn--n3h.net/p%C3%A5th There are already Python implementations to work convert between URIs and IRIs (for example in Werkzeug 0.6). Form data: Form data is encoded by all browsers in the charset of the page that renders the page. However for missing encoding declarations in the HTTP header, the browser runs a character set guessing algorithm. This algorithm is currently browser dependent but might be specified as part of HTML5. At least there is a section in the draft currently. This is a lot of charsets. So for most applications the charsets look like this: page encoding: utf-8 headers: invalid latin1 with utf-8 payload form submissions: utf-8 urls: utf-8 This is also the only configuration that looks reasonable, all the others fall to utf-8 on modern browsers every once in a while (for example if an IRI is used in an HTML document on an external resource, the browser will try utf-8 for the URL, even if that URL is in fact latin1). For Python 3, the standard library the safe path and chose utf-8 as standard encoding for URLs. The biggest grief I have with this is that URLs have to be 'str' in Python 3 (remember, that's unicode). This works and is probably a step into a better direction, but I would welcome the addition of an IRI module and advertise the use of IRIs internally. (For the 'bytes' problems see further below) Other situation where the standard library decided to went with unicode instead of bytes is the HTTP server and clients. There Python assumes latin1 for headers (which is correct on the paper). Unfortunately that complicates things a lot. Graham is right about mentioning that operating on bytes in Python 3 is a lot harder than it was in Python 2. And I'm not even talking about the missing implicit conversion, but missing functionality on the bytes. Here some common idioms found in low-level WSGI code that no longer works: String formatting: >>> b"%d %s" % (200, "OK") Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple' Integer to ASCII: >>> bytes(8) b'\x00\x00\x00\x00\x00\x00\x00\x00' >>> bytes(str(8)) Traceback (most recent call last): File "", line 1, in TypeError: string argument without an encoding >>> str(8).encode("ascii") b'8' urllib.parse appears to be buggy with bytestrings: >>> parse.quote_plus('f??'.encode('utf-8')) 'f%C3%B6%C3%B6' >>> parse.unquote_plus('f%C3%B6%C3%B6') 'f??' >>> parse.unquote_plus(b'f%C3%B6%C3%B6') Traceback (most recent call last): File "", line 1, in File "C:\python31\lib\urllib\parse.py", line 404, in unquote_plus string = string.replace('+', ' ') TypeError: expected an object with the buffer interface I'm pretty sure the latter is a bug and I will file one, however if there is broken behavior with bytestrings in Python 3.1 that's another thing we have to keep in mind. Form data handling in Python 3 based on cgi.FieldStorage currently also assumes unicode strings and from what I've read so far, it doesn't work in Python 3.1, but I have not confirmed that. In my oppinion it was a mistake to force the unicode behavior on these parts in the standard library, but now it happened and that affects the WSGI specification as well now. Based on what I've read in the code so far, I'm pretty sure we have to find some statistics about how many non utf-8 applications still exist in the wild and if we have use cases where the raw bytes are necessary. Unfortunately the bytes approach does not sound that easy to implement any more, based on the fact that the standard library no longer supports bytes for many lower level operations and that the bytes object does not provide any sort of string formattings. However, that does not make the unicode approach any less evil. Unless we have found a way that properly supports unicode in a way that we're not losing information and that makes ports of applications possible I'm strongly against it. Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 12:59:49 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 12:59:49 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4B530.7080000@active-4.com> References: <4AB4B530.7080000@active-4.com> Message-ID: <4AB4B9A5.102@active-4.com> Hi, Armin Ronacher schrieb: > urllib.parse appears to be buggy with bytestrings: > > I'm pretty sure the latter is a bug and I will file one, however if > there is broken behavior with bytestrings in Python 3.1 that's another > thing we have to keep in mind. I have to correct myself, there are separate functions for byte quoting. (parse.unquote_to_bytes, parse.quote_from_bytes). Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 13:11:44 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 13:11:44 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4B530.7080000@active-4.com> References: <4AB4B530.7080000@active-4.com> Message-ID: <4AB4BC70.6020105@active-4.com> Hi, Another observation from the HTTP server that comes with the Python 3 Standard Library: it does not support non-ASCII headers: def send_header(self, keyword, value): """Send a MIME header.""" if self.request_version != 'HTTP/0.9': self.wfile.write(("%s: %s\r\n" % (keyword, value)).encode('ASCII', 'strict')) Here an implementation that shows how ridiculous a byte based implementation on top of BaseHTTPServer currently is: http://paste.pocoo.org/show/140501/ Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 13:12:12 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 13:12:12 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4B9A5.102@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> Message-ID: <4AB4BC8C.70709@active-4.com> Hi, Armin Ronacher schrieb: > I have to correct myself, there are separate functions for byte quoting. > (parse.unquote_to_bytes, parse.quote_from_bytes). However, urlencode and urldecode are string only. Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 13:14:15 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 13:14:15 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4BC70.6020105@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4BC70.6020105@active-4.com> Message-ID: <4AB4BD07.4090202@active-4.com> Hi, Armin Ronacher schrieb: > http://paste.pocoo.org/show/140501/ Corrected version without Werkzeug leftovers: http://paste.pocoo.org/show/140502/ Regards, Armin From renesd at gmail.com Sat Sep 19 14:26:13 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 13:26:13 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4B9A5.102@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> Message-ID: <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> On Sat, Sep 19, 2009 at 11:59 AM, Armin Ronacher wrote: > Hi, > > Armin Ronacher schrieb: >> urllib.parse appears to be buggy with bytestrings: >> >> I'm pretty sure the latter is a bug and I will file one, however if >> there is broken behavior with bytestrings in Python 3.1 that's another >> thing we have to keep in mind. > I have to correct myself, there are separate functions for byte quoting. > (parse.unquote_to_bytes, parse.quote_from_bytes). > > Hi, I think that shows that they are being handled differently depending on type. Which is against polymorphism... but some people prefer to have separate functions for different types(in and out). I don't think other python functions do this though. So maybe this is a one off, and could be considered a bug... I'm not sure why they did it this way. Here is a snippet from the compat.py we used to port pygame to support python2.3 through 3.1 try: unicode_ = unicode except NameError: unicode_ = str You can see that then alows you to do this: >>> print( unicode_(b'sdf %s %s') % ('sdf', 'ef')) b'sdf sdf ef' >>> ord(unicode_('?')) 255 This allows your code to have (somewhat) the same behavior for unicode on both 2.x and 3.x. Using b'' in your code makes it impossible to share the same code base with 2.x and 3.x. In summary of the arguments (please add if I've missed something): Arguments against using bytes (and using unicode instead). ============================================== So I'm -1 on using b'' all over the place since it's not in both versions of python, and makes it impossible for code bases to share the same code for multiple versions of python. Armins code example shows how ugly it is to convert code with b'' all over the place, and how it doesn't support many operations that strings do in python2.x. -1 for that reason. Also I think the sneaky version shows the same thing with regards to b''. Since 2to3 also uses unicode instead of bytes I'm -1 on using b''. The python API also uses unicode in it's API as Armin has shown, and not bytes. So another reason for -1 on b''. Argument for using bytes: ==================== socket methods return bytes in py3k... Well, they do with recvfrom etc... but not recvfrom_into. recvfrom_into and friends put the bytes into a given buffer. ((As an off topic, we should be designing for these functions as they allow a zero-copy, and zero-memory-allocation method of web server creation in python.)) 'socket.recvfrom_into(buffer[, nbytes[, flags]])' this is new from python2.5. A work around - and suggested solution. =============================== Use unicode by default, but make another key available with raw data. So to work around the problem of (rarely/occasionally) needing the raw bytes why don't we just have raw buffer keys in the environ? This solves the case where it is needed in rare situations, and also makes the common situation (using correctly decoded unicode strings) possible? I would suggest not using bytes as the raw key, but instead a raw `buffer` object. This makes it possible to use the zero-copy, zero-memory-allocation methods. array.array is suitable here, more suitable than python3 bytes - since it is supported in older versions of python as well. Or other forms of buffer should be usable too... eg, an mmap, or a special apache buffer type, or numpy array, pygame surface buffer, PIL buffer etc. This solution optimises for: - compatibility with older pythons when using the same code base. - compatibility with older wsgi applications. - and also with the 2to3 tool trans - ease of use in the most common cases. - similarity to other python API web stuff using unicode in python 3.1. - similarity to higher level frameworks like django, webobj etc that expose unicode. - possibility to access raw data when needed (in rare situations) - possibility to write more performant code if required (with new functions introduced since python2.3 and wsgi 1.0 were introduced). From armin.ronacher at active-4.com Sat Sep 19 14:34:06 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 14:34:06 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> Message-ID: <4AB4CFBE.70604@active-4.com> Hi, Ren? Dudfield schrieb: > I think that shows that they are being handled differently depending > on type. Which is against polymorphism... but some people prefer to > have separate functions for different types(in and out). I don't > think other python functions do this though. So maybe this is a one > off, and could be considered a bug... I'm not sure why they did it > this way. The fact that urldecode and urlparse does not provide a byte-only implementation is something I would consider a bug. After all that module is called "urlparse" and not "iriparse". > Here is a snippet from the compat.py we used to port pygame to support > python2.3 through 3.1 How is that related? > Arguments against using bytes (and using unicode instead). > > So I'm -1 on using b'' all over the place since it's not in both > versions of python, and makes it impossible for code bases to share > the same code for multiple versions of python. That would not matter much because the high-level applications never see what's under the hood. Besides web2py all frameworks and libraries I know about are using unicode internally anyways. > Argument for using bytes: There are many more. It's suppose to be byte based everywhere because that's how these protocols work. There is no magic unicode layer in HTTP that solves all of our problems. - URLs are byte based, URLs are untrusted - WSGI 1.0 was byte based, API wise that means the smallest change - Frameworks don't have to be totally rewritten because they already have their own unicode conversion functions. - Except the application, nothing knows about the real encoding information. Graham's suggestion for URL encodings means that the URL encoding would ahve to be passed to the WSGI server from outside (he proposed the apache config as an example). This means that the application behavior will change based on the server configuration, causing even more confusion. Let us ignore 2to3 and syntax problem for a minute. These are a lot less complex than the actual encoding problems. Also it is very, very unlikely that applications will be able to go through 2to3 and continue to work because there is just too much stuff that changes. b'' vs '' is really the smallest issue we have with WSGI currently. Change behavior of the bytes object and a semi-unicode aware standard library are the biggest problems in my opinion. Regards, Armin From graham.dumpleton at gmail.com Sat Sep 19 14:54:10 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Sat, 19 Sep 2009 22:54:10 +1000 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4CFBE.70604@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> Message-ID: <88e286470909190554q3c9228edx5e51e9b2c14bcc55@mail.gmail.com> 2009/9/19 Armin Ronacher : > Graham's suggestion for URL encodings means that the URL encoding would > ahve to be passed to the WSGI server from outside (he proposed the > apache config as an example). ?This means that the application behavior > will change based on the server configuration, causing even more confusion. No it doesn't and you could still have things work without needing to override the default encodings applied. The default rule inside of the WSGI adapter would be: try: script_name = raw_script_name.decode('utf-8') path_info = raw_path_info.decode('utf-8') query_string = raw_query_string.decode('utf-8') uri_encoding = 'utf-8' except: script_name = raw_script_name.decode('iso-8859-1') path_info = raw_path_info.decode('iso-8859-1') query_string = raw_query_string.decode('iso-8859-1') uri_encoding = 'iso-8859-1' finally: environ['SCRIPT_NAME'] = script_name environ['PATH_INFO'] = path_info environ['QUERY_STRING'] = query_string environ['wsgi.uri_encoding'] = uri_encoding At the WSGI application level, if it provides for use of an alternate URI encoding, I saw that all it would need to do (ignoring encoding name equivalence issues for now) is: if application_uri_encoding != environ['wsgi.uri_encoding']: raw_script_name = environ['SCRIPT_NAME'].encode(environ['wsgi.uri_encoding']) raw_path_info = environ['PATH_INFO'].encode(environ['wsgi.uri_encoding']) raw_query_string = environ['QUERY_STRING'].encode(environ['wsgi.uri_encoding']) script_name = raw_script_name.decode(application_uri_encoding) path_info = raw_path_info.decode(application_uri_encoding) query_string = raw_query_string.decode(application_uri_encoding) else: script_name = environ['SCRIPT_NAME'] path_info = environ['PATH_INFO'] query_string = environ['QUERY_STRING'] So, no strict need to make the WSGI adapter do it differently. You may want to only do that if concerned about overhead of transcoding. Transcoding just these is most probably going to be less overhead than the WSGI adapter having to set up both unicode and raw values in a dictionary for everything. Even with your iso-8859-4 example, can't see how you can without knowing loose what original characters are, as wsgi.uri_encoding being provided always allows you to transcode to what you needed it to be when what was supplied didn't match. As to the separate argument about repeating slashes and percent encoding of slashes and loosing distinction, the definition using wsgi.uri_encoding also provided REQUEST_URI as bytes anyway, so you can get it directly from that as want you wanted in bytes everywhere solution anyway. Now you can go back to monologue, as definitely sleeping now. ;-) Graham From renesd at gmail.com Sat Sep 19 15:10:29 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 14:10:29 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4CFBE.70604@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> Message-ID: <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> On Sat, Sep 19, 2009 at 1:34 PM, Armin Ronacher wrote: > Hi, > > Ren? Dudfield schrieb: >> I think that shows that they are being handled differently depending >> on type. ?Which is against polymorphism... but some people prefer to >> have separate functions for different types(in and out). ?I don't >> think other python functions do this though. ?So maybe this is a one >> off, and could be considered a bug... I'm not sure why they did it >> this way. > The fact that urldecode and urlparse does not provide a byte-only > implementation is something I would consider a bug. ?After all that > module is called "urlparse" and not "iriparse". > I think they should work on buffers too. Since that's one of the types sockets support. >> Here is a snippet from the compat.py we used to port pygame to support >> python2.3 through 3.1 > How is that related? > Rather than using a 2to3 tool - which then makes you have two versions of your code, making the code work in python 2.x and 3.x. 2to3 outputs python2.x incompatible code - when it doesn't have to. >> Arguments against using bytes (and using unicode instead). >> >> So I'm -1 on using b'' all over the place since it's not in both >> versions of python, and makes it impossible for code bases to share >> the same code for multiple versions of python. > That would not matter much because the high-level applications never see > what's under the hood. ?Besides web2py all frameworks and libraries I > know about are using unicode internally anyways. > It would mean code bases need to support b'' - which is not compatible with python2. This makes it harder to port, as it restricts people to having separate code bases for each language. This is not possible for some code bases since it double the maintenance burden. Convincing people to port to python3 is already hard enough. >> Argument for using bytes: > There are many more. ?It's suppose to be byte based everywhere because > that's how these protocols work. ?There is no magic unicode layer in > HTTP that solves all of our problems. > > - URLs are byte based, URLs are untrusted > - WSGI 1.0 was byte based, API wise that means the smallest change > - Frameworks don't have to be totally rewritten because they already > ?have their own unicode conversion functions. > - Except the application, nothing knows about the real encoding > ?information. I'm advocating having two keys... one unicode and a raw buffer version of keys. - unicode because everyone is using unicode these days anyway (the web browsers, and most upper layer frameworks) - buffer for raw data as you need it sometimes and writing performant wsgi apps becomes a lot more possible. This raw buffer can be marked with any relevant encoding if needed (eg, what the browser suggests it is, and what the server suggests it is). > > Graham's suggestion for URL encodings means that the URL encoding would > ahve to be passed to the WSGI server from outside (he proposed the > apache config as an example). ?This means that the application behavior > will change based on the server configuration, causing even more confusion. > I'm not sure what this particular suggestion this is? Having wsgi apps behave the same with different servers is one of it's main points - so if that's the case that's not a good idea. > Let us ignore 2to3 and syntax problem for a minute. ?These are a lot > less complex than the actual encoding problems. ?Also it is very, very > unlikely that applications will be able to go through 2to3 and continue > to work because there is just too much stuff that changes. b'' vs '' is > really the smallest issue we have with WSGI currently. ?Change behavior > of the bytes object and a semi-unicode aware standard library are the > biggest problems in my opinion. > Well, this thread is about python3 issues. I think there's enough people who want to consider the python3 issues to not ignore it. From g.brandl at gmx.net Sat Sep 19 15:26:39 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 19 Sep 2009 15:26:39 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> Message-ID: Ren? Dudfield schrieb: >>> Here is a snippet from the compat.py we used to port pygame to support >>> python2.3 through 3.1 >> How is that related? >> > > Rather than using a 2to3 tool - which then makes you have two versions > of your code, making the code work in python 2.x and 3.x. 2to3 > outputs python2.x incompatible code - when it doesn't have to. Sorry, but I think you do not express the intent of 2to3 correctly here. It is not meant to provide a one-time conversion, so that you then have to maintain two codebases, it is meant to be run over your 2.x code every time you want to distribute a version for Python 3, or even transparently in the distutils build process. This of course means that the 2.x code needs to be written with 3.x and the conversion in mind. Writing code that runs unchanged on 2.x (where x < 6) and 3.x may seem nice, but forces you to do unnecessary workarounds, e.g. in exception handlers. >>> Arguments against using bytes (and using unicode instead). >>> >>> So I'm -1 on using b'' all over the place since it's not in both >>> versions of python, and makes it impossible for code bases to share >>> the same code for multiple versions of python. >> That would not matter much because the high-level applications never see >> what's under the hood. Besides web2py all frameworks and libraries I >> know about are using unicode internally anyways. >> > > It would mean code bases need to support b'' - which is not compatible > with python2. b'' is supported as of Python 2.6. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From renesd at gmail.com Sat Sep 19 15:51:06 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 14:51:06 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <88e286470909190554q3c9228edx5e51e9b2c14bcc55@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <88e286470909190554q3c9228edx5e51e9b2c14bcc55@mail.gmail.com> Message-ID: <64ddb72c0909190651r13609c39ha28356927240062e@mail.gmail.com> On Sat, Sep 19, 2009 at 1:54 PM, Graham Dumpleton wrote: > 2009/9/19 Armin Ronacher : >> Graham's suggestion for URL encodings means that the URL encoding would >> ahve to be passed to the WSGI server from outside (he proposed the >> apache config as an example). ?This means that the application behavior >> will change based on the server configuration, causing even more confusion. > > No it doesn't and you could still have things work without needing to > override the default encodings applied. > > The default rule inside of the WSGI adapter would be: > > ?try: > ? ?script_name = raw_script_name.decode('utf-8') > ? ?path_info = raw_path_info.decode('utf-8') > ? ?query_string = raw_query_string.decode('utf-8') > ? ?uri_encoding = 'utf-8' > ?except: > ? ?script_name = raw_script_name.decode('iso-8859-1') > ? ?path_info = raw_path_info.decode('iso-8859-1') > ? ?query_string = raw_query_string.decode('iso-8859-1') > ? ?uri_encoding = 'iso-8859-1' > ?finally: > ? ?environ['SCRIPT_NAME'] = script_name > ? ?environ['PATH_INFO'] = path_info > ? ?environ['QUERY_STRING'] = query_string > ? ?environ['wsgi.uri_encoding'] = uri_encoding > > At the WSGI application level, if it provides for use of an alternate > URI encoding, I saw that all it would need to do (ignoring encoding > name equivalence issues for now) is: > > ?if application_uri_encoding != environ['wsgi.uri_encoding']: > ? ?raw_script_name = > environ['SCRIPT_NAME'].encode(environ['wsgi.uri_encoding']) > ? ?raw_path_info = environ['PATH_INFO'].encode(environ['wsgi.uri_encoding']) > ? ?raw_query_string = > environ['QUERY_STRING'].encode(environ['wsgi.uri_encoding']) > > ? ?script_name = raw_script_name.decode(application_uri_encoding) > ? ?path_info = raw_path_info.decode(application_uri_encoding) > ? ?query_string = raw_query_string.decode(application_uri_encoding) > > ?else: > ? ?script_name = environ['SCRIPT_NAME'] > ? ?path_info = environ['PATH_INFO'] > ? ?query_string = environ['QUERY_STRING'] > > So, no strict need to make the WSGI adapter do it differently. You may > want to only do that if concerned about overhead of transcoding. > > Transcoding just these is most probably going to be less overhead than > the WSGI adapter having to set up both unicode and raw values in a > dictionary for everything. > Can these be lazily transcoded? I think they can if they are turned into callables. Since the environ has to be of a dict type, and not some other type(unless that design should also be changed). So the current ones stay as is... to reflect current usage, and new ones use callables. The callables return the type you ask for. This way its possible to not do any encoding/decoding as needed and only when needed. For applications using the new way: # we can pass in the encoding we want. script_name = environ['SCRIPT_NAME_'](application_uri_encoding) script_name_utf8 = environ['SCRIPT_NAME_']('utf-8') script_name_iso_8859_1 = environ['SCRIPT_NAME_']('iso-8859-1') # we can get it as a buffer. script_name_buffer = environ['SCRIPT_NAME_'](as_buffer = True) # we can get it as whatever the raw native type is. script_name_native = environ['SCRIPT_NAME_'](native_type = True) # here we get the default encoding and type - which could be unicode or bytes. script_name_default_type = environ['SCRIPT_NAME_']() For servers: Servers store just the native raw version in the environ(as buffer, or whatever their native type and encoding is), and callables to do any transcoding as needed. If the application does not use it, then the server doesn't use any resources transcoding. From armin.ronacher at active-4.com Sat Sep 19 15:56:21 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 15:56:21 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <88e286470909190554q3c9228edx5e51e9b2c14bcc55@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <88e286470909190554q3c9228edx5e51e9b2c14bcc55@mail.gmail.com> Message-ID: <4AB4E305.7090504@active-4.com> Hi, Graham Dumpleton schrieb: > So, no strict need to make the WSGI adapter do it differently. You may > want to only do that if concerned about overhead of transcoding. > > Transcoding just these is most probably going to be less overhead than > the WSGI adapter having to set up both unicode and raw values in a > dictionary for everything. So if I understand you correctly the wsgi.uri_encoding would be used *only* as a information what the URI encoding was, the application however should use the internal encoding it wants? That sounds right, but then let's make that should a MUST. Your query_string example is flawed as the query string is always quoted and encoding/decoding an ASCII only string will not change much if the encoding is a superset of ASCII which is required anyways for various reasons. I would go with this wording for the spec then: wsgi.uri_encoding holds the encoding of the URI that was used to decode the SCRIPT_NAME and PATH_INFO. If the application decodes the query string it MUST obey the encoding here. If REQUEST_URI is available, the server will use the URI encoding to decode this value as well. However for encoding of URIs it MUST not use the wsgi.uri_encoding information but MUST use UTF-8 to encode the URI. Backwards compatibility for URIs: If the application depends on non UTF-8 URIs and the fallback encoding is NOT latin1 the application will have to check the wsgi.uri_encoding for latin1 and if it detects it, it has to encode back to latin1 and decode from the fallback encoding (eg: iso-8859-7). WSGI 2.0 however requires the application to use UTF-8 for generated URIs. I checked the browser implementations now and for arbitrary URIs (not generated URIs in a page) the browser will always try UTF-8. RFC 3987 also recommends UTF-8 for URIs. > Even with your iso-8859-4 example, can't see how you can without > knowing loose what original characters are, as wsgi.uri_encoding being > provided always allows you to transcode to what you needed it to be > when what was supplied didn't match. Assuming the only possible values for wsgi.uri_encoding are latin1/iso-8859-1 and utf-8 when the application is invoked, I'm totally fine with that. Because if the application's fallback URI encoding is something like iso-8859-4, the application can itself check for latin1 and reencode the data. I could live with that. What I don't want to see in WSGI is that the fallback encoding (latin1) could be changed in the server configuration. > Now you can go back to monologue, as definitely sleeping now. ;-) \o/ Regards, Armin From armin.ronacher at active-4.com Sat Sep 19 16:00:49 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 16:00:49 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> Message-ID: <4AB4E411.5000106@active-4.com> Hi, Ren? Dudfield schrieb: > Rather than using a 2to3 tool - which then makes you have two versions > of your code, making the code work in python 2.x and 3.x. 2to3 > outputs python2.x incompatible code - when it doesn't have to. 2to3 is intended to be run automatically for each release. You would not maintain two versions. > It would mean code bases need to support b'' - which is not compatible > with python2. This makes it harder to port, as it restricts people to > having separate code bases for each language. This is not possible > for some code bases since it double the maintenance burden. > Convincing people to port to python3 is already hard enough. Byte literals are available in Python 2.6. As far as I'm concerend I don't see a real reason to port to Python 3 at the moment. We should rather get our stuff ready that once Python 2.6 is the standard the porting becomes as simple as possible. Supporting Python 2.4, 2.5, 2.6 and 3.x is a very complex task that does not work for every library (due to changed APIs for example). > Well, this thread is about python3 issues. I think there's enough > people who want to consider the python3 issues to not ignore it. We cannot fight on too many fronts at the same time. This thread is about unicode and encodings, not about Python 3 syntax. 2to3 tackles the latter, if it does not work for you, consider writing that to the porting mailinglist. Regards, Armin From renesd at gmail.com Sat Sep 19 16:01:38 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 15:01:38 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> Message-ID: <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> On Sat, Sep 19, 2009 at 2:26 PM, Georg Brandl wrote: > Ren? Dudfield schrieb: > >>>> Here is a snippet from the compat.py we used to port pygame to support >>>> python2.3 through 3.1 >>> How is that related? >>> >> >> Rather than using a 2to3 tool - which then makes you have two versions >> of your code, making the code work in python 2.x and 3.x. ?2to3 >> outputs python2.x incompatible code - when it doesn't have to. > > Sorry, but I think you do not express the intent of 2to3 correctly here. > It is not meant to provide a one-time conversion, so that you then > have to maintain two codebases, it is meant to be run over your 2.x code > every time you want to distribute a version for Python 3, or even > transparently in the distutils build process. ?This of course means that > the 2.x code needs to be written with 3.x and the conversion in mind. > My point is: using b'' stops those that choose to have one code base. Not everyone can use 2to3, but for those that can: great! There is no 2to3 for extension modules. There is no 2to3 distutils mod to run 2to3 automatically at this time(correct me if I'm wrong). People are creating separate branches for py3k... and those projects that do that seem to let the py3k version rot. You still need to debug, and support multiple versions of code... since 2to3 generates multiple versions. If someone sends you a patch for the 3.0 version you need to either reverse it yourself or find someone to do it for you... same thing with bug reports and tracebacks. There's some points for why 2to3 is not ok for every project. > Writing code that runs unchanged on 2.x (where x < 6) and 3.x may seem > nice, but forces you to do unnecessary workarounds, e.g. in exception > handlers. Well, I'm sure there are cases where it would cause unnecessary workarounds... however with the right compat.py and compat.h setup it hasn't been too hard in my experience in porting this way. There is an easy workaround for the exceptions changes... # define geterror in your compatibility module. def geterror (): return sys.exc_info()[1] Now you can write: except ImportError: e = geterror() Instead of these: #py2 except ImportError, e: pass #py3k except ImportError as e: pass > >>>> Arguments against using bytes (and using unicode instead). >>>> >>>> So I'm -1 on using b'' all over the place since it's not in both >>>> versions of python, and makes it impossible for code bases to share >>>> the same code for multiple versions of python. >>> That would not matter much because the high-level applications never see >>> what's under the hood. ?Besides web2py all frameworks and libraries I >>> know about are using unicode internally anyways. >>> >> >> It would mean code bases need to support b'' - which is not compatible >> with python2. > > b'' is supported as of Python 2.6. > > Georg ah yes. I guess I meant the python2 series. Python2.5 is still the most popular python... with 2.6 catching up(or passing it) in popularity. So that should be changed to: """It would mean code bases need to support b'' - which is not compatible with <= python2.5.4""" ... anyway, just something to consider. ps. my facebook account on this email address was just banned. I swear I didn't rant about how tornado sucks! From g.brandl at gmx.net Sat Sep 19 16:37:48 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 19 Sep 2009 16:37:48 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> Message-ID: Ren? Dudfield schrieb: > On Sat, Sep 19, 2009 at 2:26 PM, Georg Brandl wrote: >> Ren? Dudfield schrieb: >> >>>>> Here is a snippet from the compat.py we used to port pygame to support >>>>> python2.3 through 3.1 >>>> How is that related? >>>> >>> >>> Rather than using a 2to3 tool - which then makes you have two versions >>> of your code, making the code work in python 2.x and 3.x. 2to3 >>> outputs python2.x incompatible code - when it doesn't have to. >> >> Sorry, but I think you do not express the intent of 2to3 correctly here. >> It is not meant to provide a one-time conversion, so that you then >> have to maintain two codebases, it is meant to be run over your 2.x code >> every time you want to distribute a version for Python 3, or even >> transparently in the distutils build process. This of course means that >> the 2.x code needs to be written with 3.x and the conversion in mind. >> > > My point is: using b'' stops those that choose to have one code base. > Not everyone can use 2to3, but for those that can: great! > > There is no 2to3 for extension modules. Of course not, since it is for Python code. How you handle your C modules (whose API has not changed very dramatically) is not relevant to how you handle your Python modules. > There is no 2to3 distutils mod to run 2to3 automatically at this time > (correct me if I'm wrong). Yes, there is. It's what e.g. docutils and pygments are using. > People are creating separate branches for py3k... and those projects > that do that seem to let the py3k version rot. That's because they don't get enough advice about porting, despite there being a mailing list, or frown upon using 2to3 for whatever reasons. > You still need to debug, and support multiple versions of code... > since 2to3 generates multiple versions. This is what a test suite is made for. > If someone sends you a patch for the 3.0 version > you need to either reverse it yourself or find someone to do it for > you... same thing with bug reports and tracebacks. Yes, but the version that 2to3 generates is usually not far from the original, so you should have no problem adapting the patch. When someone sends you a patch for a single-source project, he will very probably not know that it is a single-source project, and use either 2.x or 3.x specific things in it, and you have to adapt it as well. Certainly, maintaining a project for 2.x and 3.x compatibility *is* going to be more work, no matter how you choose to do it. But saying that 2to3 is bad because you then have to source trees to maintain is plain wrong. >> Writing code that runs unchanged on 2.x (where x < 6) and 3.x may seem >> nice, but forces you to do unnecessary workarounds, e.g. in exception >> handlers. > > Well, I'm sure there are cases where it would cause unnecessary > workarounds... however with the right compat.py and compat.h setup it > hasn't been too hard in my experience in porting this way. > > There is an easy workaround for the exceptions changes... > # define geterror in your compatibility module. > def geterror (): > return sys.exc_info()[1] > > Now you can write: > except ImportError: > e = geterror() This is exactly what I mean by unnecessary workarounds. Anyway, this thread is not supposed to be about 2to3 or syntax differences between 2.x and 3.x, so I'll stop here; more about this should go to python-porting. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From renesd at gmail.com Sat Sep 19 18:00:04 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 17:00:04 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> Message-ID: <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> Hello again, ok, getting back on topic... away from py3k porting methods... Using an API where the user can request the type wanted solves a lot of encoding issues. This is similar to Grahams suggestion, but instead allowing a user to request which encoding they want, and also get access to the raw data if needed. What is proposed: 1. Default utf-8 to be used. 2. A buffer to be used for raw data. 3. New keys which are callables to request the encoding you want. 4. Encoding keys are specified. 4.a URI encoding key 'wsgi.uri_encoding' 4.b Form data encoding key 'wsgi.form_encoding' 4.c Page encoding key 'wsgi.page_encoding' 4.d Header encoding key 'wsgi.header_encoding' 5. For next version of wsgi (1.1 or 2.0), using an adapter for backwards compat for wsgi 1.0 apps on wsgi2 server. This allows or this is good because: 1. utf-8 is most common for frameworks and web browsers. 2.a Raw values to be accessed in the rare cases they are needed. 2.b More performant wsgi servers (zero-copy and zero-allocation become possible with buffers) 2.c Avoiding bytes type and syntax for compatibility with <= python 2.5.4 (buffer, and unicode) 3. Transcoding to only happen if needed. 4. URI encoding can be explicitly stated in a URI key 5. Backwards compat for wsgi 1.0 apps on wsgi 2 server. Also wsgi 2.0 apps on wsgi 1.0 server with an adapter. How applications use this proposal: # here we get the default encoding and type - unicode utf-8, and it's urldecoded. script_name_default_type = environ['SCRIPT_NAME']() # we can pass in the encoding we want. script_name = environ['SCRIPT_NAME'](application_uri_encoding) script_name_utf8 = environ['SCRIPT_NAME']('utf-8') script_name_iso_8859_1 = environ['SCRIPT_NAME']('iso-8859-1') # we can get it as a buffer with raw bytes. script_name_buffer = environ['SCRIPT_NAME'](as_buffer = True, no_urldecoding = True) # we can get it as whatever the raw native type is. script_name_native = environ['SCRIPT_NAME'](native_type = True, no_urldecoding = True) For servers: Servers store only the native raw version in the environ(as buffer, or whatever their native type and encoding is), and callables to do any transcoding as needed. If the application does not use it, then the server doesn't use any resources transcoding or storing different transcoded versions. Adapters: To make it easier for backwards compatibility wsgiref should have adapters for old servers and clients. For wsgi 1.0 apps on wsgi 2.0 servers: An adapter would be written to return a wsgi1 key suitable environ. For wsgi 1.0 servers running wsgi 2.0 apps. An adapter should be available to let wsgi 2.0 apps run on wsgi 1.0 servers. Issues with proposal? Things this proposal did not consider? - maybe we could be explicit about what the http server, http client, wsgi client, and application think the encodings are. This might allow 'fail fast', and sanity checking so things aren't messed up silently. If the webserver, web client and application developer all specifiy what they are expecting... then checks could be done, otherwise if one of them can't specify for some reason, then it's the situation we are in now. Haven't thought this through much. From MDiPierro at cs.depaul.edu Sat Sep 19 18:01:18 2009 From: MDiPierro at cs.depaul.edu (Massimo Di Pierro) Date: Sat, 19 Sep 2009 11:01:18 -0500 Subject: [Web-SIG] python3 wsgi. Re: WSGI 1 Changes [ianb's and my changes] In-Reply-To: <4AB49C89.5070608@active-4.com> References: <64ddb72c0909181144h599aff1die0f56d113d2a01f4@mail.gmail.com> <4586E9BC-2709-4697-A94B-CE5C787257C3@cs.depaul.edu> <4AB49C89.5070608@active-4.com> Message-ID: <3A243C67-6A86-4722-9116-36E75BCF801A@cs.depaul.edu> Charming as ever. ;-) the code had a typo , sorry, I fixed it. http://web2py.com/examples/static/sneaky.py http://web2py.com/examples/static/sneaky3.py I am not making a statement about any of the proposals. I am just saying: here is a multithreaded web server that works with python 3.0 in less than 300 lines of code. it can be used for testing ideas. I acknowledge you are the wsgi experts, that is why I sent it t you. If you have any specific critique that will make it better, please let me know, or just ignore me. It is a work in progress (its features and speed and discussed in the doc string) and I hope it can be useful. Massimo On Sep 19, 2009, at 3:55 AM, Armin Ronacher wrote: > Hi, > > Massimo Di Pierro schrieb: >> I liked your idea very much Rene' , so I made this > Can you please stop that before you do any more damage? Your code is > not even anywhere close to what was discussed and has tons of errors > and > ugly bits and pieces in there. > > Again. An example does not bring us anything because we already know > the implications of each proposal. > > > Regards, > Armin From mdipierro at cs.depaul.edu Sat Sep 19 18:06:33 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Sat, 19 Sep 2009 11:06:33 -0500 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> Message-ID: <2D30E9FA-7EC9-4E0A-B50D-DC54D38D83B4@cs.depaul.edu> I agree I was forced to write two files. The problems where 1) b'....' vs '....' (could be solved using eval('b"...."') if python3 else eval("....") but ugly) 2) try:...:except Exeption,e: vs try:...except Exception as e. (no way around it) 3) it would have required a lot of if statements to convert str<- >bytes because bytes do have .find. (it is not a technical problem but makes code way less readable). On Sep 19, 2009, at 9:01 AM, Ren? Dudfield wrote: > On Sat, Sep 19, 2009 at 2:26 PM, Georg Brandl > wrote: >> Ren? Dudfield schrieb: >> >>>>> Here is a snippet from the compat.py we used to port pygame to >>>>> support >>>>> python2.3 through 3.1 >>>> How is that related? >>>> >>> >>> Rather than using a 2to3 tool - which then makes you have two >>> versions >>> of your code, making the code work in python 2.x and 3.x. 2to3 >>> outputs python2.x incompatible code - when it doesn't have to. >> >> Sorry, but I think you do not express the intent of 2to3 correctly >> here. >> It is not meant to provide a one-time conversion, so that you then >> have to maintain two codebases, it is meant to be run over your 2.x >> code >> every time you want to distribute a version for Python 3, or even >> transparently in the distutils build process. This of course means >> that >> the 2.x code needs to be written with 3.x and the conversion in mind. >> > > My point is: using b'' stops those that choose to have one code base. > Not everyone can use 2to3, but for those that can: great! > > There is no 2to3 for extension modules. There is no 2to3 distutils > mod to run 2to3 automatically at this time(correct me if I'm wrong). > People are creating separate branches for py3k... and those projects > that do that seem to let the py3k version rot. You still need to > debug, and support multiple versions of code... since 2to3 generates > multiple versions. If someone sends you a patch for the 3.0 version > you need to either reverse it yourself or find someone to do it for > you... same thing with bug reports and tracebacks. > > There's some points for why 2to3 is not ok for every project. > >> Writing code that runs unchanged on 2.x (where x < 6) and 3.x may >> seem >> nice, but forces you to do unnecessary workarounds, e.g. in exception >> handlers. > > Well, I'm sure there are cases where it would cause unnecessary > workarounds... however with the right compat.py and compat.h setup it > hasn't been too hard in my experience in porting this way. > > There is an easy workaround for the exceptions changes... > # define geterror in your compatibility module. > def geterror (): > return sys.exc_info()[1] > > Now you can write: > except ImportError: > e = geterror() > > Instead of these: > #py2 > except ImportError, e: > pass > #py3k > except ImportError as e: > pass > > > >> >>>>> Arguments against using bytes (and using unicode instead). >>>>> >>>>> So I'm -1 on using b'' all over the place since it's not in both >>>>> versions of python, and makes it impossible for code bases to >>>>> share >>>>> the same code for multiple versions of python. >>>> That would not matter much because the high-level applications >>>> never see >>>> what's under the hood. Besides web2py all frameworks and >>>> libraries I >>>> know about are using unicode internally anyways. >>>> >>> >>> It would mean code bases need to support b'' - which is not >>> compatible >>> with python2. >> >> b'' is supported as of Python 2.6. >> >> Georg > > ah yes. I guess I meant the python2 series. Python2.5 is still the > most popular python... with 2.6 catching up(or passing it) in > popularity. So that should be changed to: > > """It would mean code bases need to support b'' - which is not > compatible with <= python2.5.4""" > > ... anyway, just something to consider. > > > ps. my facebook account on this email address was just banned. I > swear I didn't rant about how tornado sucks! > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu From pje at telecommunity.com Sat Sep 19 18:07:44 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sat, 19 Sep 2009 12:07:44 -0400 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.co m> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> Message-ID: <20090919160739.03F043A4069@sparrow.telecommunity.com> At 05:00 PM 9/19/2009 +0100, Ren? Dudfield wrote: >Issues with proposal? Things this proposal did not consider? I'm wary of anything that makes correct middleware harder to write. How should an application that does path traversal work? More precisely, take a look at the wsgiref.util.shift_path_info() function; how would it work under your proposal? From renesd at gmail.com Sat Sep 19 18:12:49 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 17:12:49 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <20090919160739.03F043A4069@sparrow.telecommunity.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> <20090919160739.03F043A4069@sparrow.telecommunity.com> Message-ID: <64ddb72c0909190912vef84429g2b26786c5a8bc4e6@mail.gmail.com> On Sat, Sep 19, 2009 at 5:07 PM, P.J. Eby wrote: > At 05:00 PM 9/19/2009 +0100, Ren? Dudfield wrote: >> >> Issues with proposal? ?Things this proposal did not consider? > > I'm wary of anything that makes correct middleware harder to write. ?How > should an application that does path traversal work? > > More precisely, take a look at the wsgiref.util.shift_path_info() function; > how would it work under your proposal? > hi, Not sure. Will look into it. Another thing I just realised should be in there... there should be an optional way to specify a buffer for the various functions to write into. Otherwise you are forced to allocate memory after all. Will send an update addressing those two things... cu, From armin.ronacher at active-4.com Sat Sep 19 19:00:03 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 19:00:03 +0200 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> Message-ID: <4AB50E13.9040706@active-4.com> Hi, Ren? Dudfield schrieb: > What is proposed: Where was that proposed? > 1. Default utf-8 to be used. That's a possibility yes, but it has to be carefully be considered. > 2. A buffer to be used for raw data. What is raw data? If you mean we keep the unencoded data around, I would strongly argue against that. Otherwise it makes middlewares even harder to write. > 3. New keys which are callables to request the encoding you want. Did I miss something? Why are we requesting encodings now? > 4. Encoding keys are specified. > 4.a URI encoding key 'wsgi.uri_encoding' > 4.b Form data encoding key 'wsgi.form_encoding' > 4.c Page encoding key 'wsgi.page_encoding' > 4.d Header encoding key 'wsgi.header_encoding' I don't know where you are getting that from. The only WSGI key would be `wsgi.uri_encoding` and that is only set by the server and only used for legacy non UTF-8 URLs. > 5. For next version of wsgi (1.1 or 2.0), using an adapter for > backwards compat for wsgi 1.0 apps on wsgi2 server. No decision about WSGI versioning was made so far. If WSGI in Python 3 is based on unicode, then the version is raised to 1.1, 2.0 is not yet discussed as far as I'm concerned. > 2.c Avoiding bytes type and syntax for compatibility with <= > python 2.5.4 (buffer, and unicode) If WSGI for Python 3 is based on Unicode it will use '' for textual context and b'' for bytes. If it's based on bytes it will obviously use the byte literals. > 3. Transcoding to only happen if needed. I can't see how that would work if it's based on unicode, if it's based on bytes that's already what happens in WSGI 1. > 4. URI encoding can be explicitly stated in a URI key This value is only *set* by the server on decode, the value is to be ignored by the actual application or middleware except for QUERY_STRING and REQUEST_URI decoding. Everything else makes things a lot more complicated without improving anything. > 5. Backwards compat for wsgi 1.0 apps on wsgi 2 server. Also wsgi > 2.0 apps on wsgi 1.0 server with an adapter. Again, WSGI 2.0 is something that has to be discussed separately, otherwise we totally lose track. > Issues with proposal? Things this proposal did not consider? Yes you did: - it has no real world advantage over either WSGI based on unicode that is utf-8 with latin1 fallback or a WSGI based on bytes. - it's backwards incompatible in every way, even to CGI. - it is slow because every dict access would also cause a function call. Furthermore middlewares would most likely start causing circular dependencies when they replace the callable with a new callable and they do not alias the value as a local in the frame that created it. Regards, Armin From fumanchu at aminus.org Sat Sep 19 19:20:59 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sat, 19 Sep 2009 10:20:59 -0700 Subject: [Web-SIG] String Types in WSGI [Graham's WSGI for py3] In-Reply-To: <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> References: <4AB2634C.2070009@active-4.com><88e286470909180056y7b306e0er94b5d3519e88455c@mail.gmail.com><64ddb72c0909180112h35ad1f23qe643f980f46cc267@mail.gmail.com><01AADF93-8D07-46E1-81EC-32B51657538A@couch.it><88e286470909180321u5f7115a5u877324ee562468bf@mail.gmail.com><64ddb72c0909180445w788cce7eqbb5f12d893290b7d@mail.gmail.com> <88e286470909180455r5877b687waff6609fd864af9b@mail.gmail.com> Message-ID: Ren? Dudfield wrote: > No, slash encoding and normalising are not the only issues. > As mentioned before sometimes you need the exact bytes. > > 1. buggy clients. ?If a client sends something that doesn't work > correctly, you can still sometimes make sense of it in the raw version > of the url. > 2. client APIs that require the server to know the exact url. > 3. buggy servers that don't do their job properly. > 4. extensibility. ?A url scheme changes a tiny bit, and you want to > support the change. ?Having the raw url allows you do to support it > on old servers. > > In all APIs it's handy to go to lower levels when the higher levels > don't work right. ?Especially when wsgi only handles one side of > things, and urls are can be generated by anything. and Graham Dumpleton replied: > This is where it all comes down to me not have the real world > experience in writing web applications to know best. > > What I would like to hear is PJE (who tends towards #3) and Robert > Brewer (who tends towards #4). Can you guys give counter explanations > as to why there arguments for bytes isn't valid. Ian, I don't think > you have yet expressed your leaning, but would like to here your point > as well. No; in fact, I agree that REQUEST_URI should be mandated as bytes. IIRC, I'm the one who proposed it ;) Robert Brewer fumanchu at aminus.org From armin.ronacher at active-4.com Sat Sep 19 20:13:57 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 19 Sep 2009 20:13:57 +0200 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated Message-ID: <4AB51F65.2010206@active-4.com> Hi, I know I pretty much SPAM the list here now which is why I added all the changes of WSGI 1.0 and what could become WSGI 1.1 into a repo on bitbucket as two PEPS: http://bitbucket.org/ianb/wsgi-peps/src/ pep-0333.txt This is basically just a new revision for PEP 333 changing the following things: - removing Jython and Python 2.2 compatibility. Jython is close enough to modern Python versions now that this does not make any difference. - fixing wsgi.input by adding a proper readline(). The current version still requires the user to care about not reading past the content length, but if all server implementors agree that could be changed so that the stream provides an end of line marker. - mentioning that WSGI 1.0 is not supported by Python 3. - made WSGI 1.0 depend on bytes. - fixed example code - servers may no longer add a date or server header if that header is already present. (This MUST may become a SHOULD for the server header as it's probably hard to control for things like mod_wsgi) - weakened the rules for buffering and streaming. Everybody does it, so it should be allowed. - added middleware warning for `wsgi.file_wrapper` pep-XXXX.txt This specifies WSGI 1.1 based on #3/#4 in Graham Dumpletons Blog post. The differences to his proposal: - the application iterator must by byte based. I would really require that, so that people explicitly encode their stuff as utf-8 instead of yielding latin1. If we want to allow unicode return values I strongly encourage using utf-8 for the return value because we already require UTF-8 URLs. - clarified wsgi.uri_encoding, that algorithm should not be the default but the only one to make it easier for applications to reencode URIs. - Stick to `start_response` and `exc_info` but add deprecation warnings for `exc_info` and `write()`. This should make it easier to port applications over. Breaking too many APIs at the same time is probably not the best idea. If we really want to get rid of `start_response` at the same time, I would suggest using ``(appiter, status, headers)`` instead of ``(status, headers, appiter)``. The former is the current common signature of response objects which would make it possible to convert from a WSGI application response to a response object by doing something like this: response = Response(*wsgi_app(request.environ)) The XXXX PEP is currently missing any copyright information and headers and should only be considered as a draft. Regards, Armin From renesd at gmail.com Sat Sep 19 20:14:23 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sat, 19 Sep 2009 19:14:23 +0100 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB50E13.9040706@active-4.com> References: <4AB4B530.7080000@active-4.com> <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> <4AB50E13.9040706@active-4.com> Message-ID: <64ddb72c0909191114s4859a77dg618216c0e8e6e284@mail.gmail.com> On Sat, Sep 19, 2009 at 6:00 PM, Armin Ronacher wrote: > Hi, > > Ren? Dudfield schrieb: >> What is proposed: > Where was that proposed? > >> ? ? 1. Default utf-8 to be used. > That's a possibility yes, but it has to be carefully be considered. > >> ? ? 2. A buffer to be used for raw data. > What is raw data? ?If you mean we keep the unencoded data around, I > would strongly argue against that. ?Otherwise it makes middlewares even > harder to write. > raw data in this case is what ever the data from the server is. The idea is to convert it on demand. >> ? ? 3. New keys which are callables to request the encoding you want. > Did I miss something? ?Why are we requesting encodings now? > You can request encodings. The idea is to make it explicit about which encoding you want. This also allows no conversion to take place if it isn't needed. Converting strings is a waste of time if it's not needed. >>> b = "a" * 4096 >>> %timeit b.decode('utf-8') 100000 loops, best of 3: 15 ?s per loop Even length 1 strings take a while too. >>> b = "a" >>> %timeit b.decode('utf-8') 1000000 loops, best of 3: 1.84 ?s per loop Note, that you need a method call with the decode anyway. In comparison a method call is a tiny amount of time. >>> a = {'SCRIPT_INFO':"asdfasdf", 'SCRIPT_INFO2': lambda : 'asdfasdf2'} >>> %timeit a['SCRIPT_INFO2']() 1000000 loops, best of 3: 267 ns per loop >>> %timeit a['SCRIPT_INFO'] 10000000 loops, best of 3: 122 ns per loop This is why avoiding encode/decode work is better. If environ was allowed to be a non dict... and a real object then it would be possible to avoid the dict key lookup and the method call. >> ? ? 4. Encoding keys are specified. >> ? ? 4.a URI encoding key 'wsgi.uri_encoding' >> ? ? 4.b Form data encoding key 'wsgi.form_encoding' >> ? ? 4.c Page encoding key 'wsgi.page_encoding' >> ? ? 4.d Header encoding key 'wsgi.header_encoding' > I don't know where you are getting that from. ?The only WSGI key would > be `wsgi.uri_encoding` and that is only set by the server and only used > for legacy non UTF-8 URLs. > I got that from your list of things with different encodings. Why not use it for the other parts as well? Some header keys use different encodings, as does form data, and page encodings. >> ? ? 5. For next version of wsgi (1.1 or 2.0), using an adapter for >> backwards compat for wsgi 1.0 apps on wsgi2 server. > No decision about WSGI versioning was made so far. ?If WSGI in Python 3 > is based on unicode, then the version is raised to 1.1, ?2.0 is not yet > discussed as far as I'm concerned. > Sure, it's a separate issue. However I'm addressing it here, . WSGI 2.0 has been discussed in various emails recently, and in grahams blog post. Also here is a wsgi 2.0 wiki page on wsgi.org. >> ? ? 2.c Avoiding bytes type and syntax for compatibility with <= >> python 2.5.4 (buffer, and unicode) > If WSGI for Python 3 is based on Unicode it will use '' for textual > context and b'' for bytes. ?If it's based on bytes it will obviously use > the byte literals. Again, using bytes doesn't seem as nice as using buffers along with unicode. Since buffers can be faster(not immutable so you can avoid memory allocation, and make use of zero copy networking), and buffers are available in more versions of python. > >> ? ? 3. Transcoding to only happen if needed. > I can't see how that would work if it's based on unicode, if it's based > on bytes that's already what happens in WSGI 1. > Since you can request different encodings, if an encoding is available it can be given... if it's not available the conversion can be made. If you don't need the conversion to be done... the conversion can be avoided completely. >> ? ? 4. URI encoding can be explicitly stated in a URI key > This value is only *set* by the server on decode, the value is to be > ignored by the actual application or middleware except for QUERY_STRING > and REQUEST_URI decoding. ?Everything else makes things a lot more > complicated without improving anything. > yeah, the server states what is happening. As the application requests what it wants, it doesn't need to query those keys. >> ? ? 5. Backwards compat for wsgi 1.0 apps on wsgi 2 server. ?Also wsgi >> 2.0 apps on wsgi 1.0 server with an adapter. > Again, WSGI 2.0 is something that has to be discussed separately, > otherwise we totally lose track. > >> Issues with proposal? ?Things this proposal did not consider? > Yes you did: > > - ?it has no real world advantage over either WSGI based on unicode > ? that is utf-8 with latin1 fallback or a WSGI based on bytes. I listed all the advantages in the 'This allows or this is good because:' section. Can you explain why they are not real? > - ?it's backwards incompatible in every way, even to CGI. why is it? wsgi apps can use an adapter to use it. wsgi 1.0 servers can also use an adapter. > - ?it is slow because every dict access would also cause a function > ? call. As explained above, the transcoding cost can be avoided or reduced, function calls need to be made anyway (the decode() calls), and there's also the possibility of using buffers to avoid memory allocation and allow zero copy networking. > Furthermore middlewares would most likely start causing > ? circular dependencies when they replace the callable with a new > ? callable and they do not alias the value as a local in the frame > ? that created it. > Yes, I think the callables will need a set method... rather than letting the middleware replace callables. I think this could be used for middleware: environ['SCRIPT_NAME'](set = "/bla/", urldecoding = False, encoding ='utf-8') but then this(one callable) would probably be better ;) environ(what='SCRIPT_NAME', set = "/bla/", urldecoding = False, encoding ='utf-8') Since changing the middleware could potentially trigger the rest of the decoding. In some situations you would want to avoid reading from the socket at all. So middleware changing stuff would mean you would need to read from the socket(obviously you need to read stuff before changing it). Why would you not want to read from the socket at all? (wsgi 1.0 makes these impossible) - to block certain hosts by looking at their ip. - you might just care about a connection, like any connection triggers an action. - for load balancing - to look at the port number, eg, to check if port 443 is used. - if you are overloaded(dos), you want to drop the connection right away. - ... others. So allowing the server to avoid most processing before the application requests certain data could be a good thing. So with middleware changing the environ, it means that all those callables need to be linked to allow the rest of them to know something has been changed. So that when one thing is changed, it drops back to wsgi 1.0 behaviour - that is, some of the encoding is done just before any change is allowed. Or maybe middleware has to call a environ['changing']() callable. Which could then trigger the callables internal transcoding and socket reading etc. I'm not sure if it will make middleware harder to use or not still. I'm working through the function Philip sent to see how it turns out, and will send an updated proposal after that. From henry at precheur.org Sat Sep 19 22:05:58 2009 From: henry at precheur.org (Henry Precheur) Date: Sat, 19 Sep 2009 13:05:58 -0700 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <64ddb72c0909191114s4859a77dg618216c0e8e6e284@mail.gmail.com> References: <4AB4B9A5.102@active-4.com> <64ddb72c0909190526n607747fbod923588b0ecbbeab@mail.gmail.com> <4AB4CFBE.70604@active-4.com> <64ddb72c0909190610j1e82dce4s7e761f1cc57ecafe@mail.gmail.com> <64ddb72c0909190701p6853d092m45d77f2c6f40a59c@mail.gmail.com> <64ddb72c0909190900m3fb30426i4812e7f00db4d998@mail.gmail.com> <4AB50E13.9040706@active-4.com> <64ddb72c0909191114s4859a77dg618216c0e8e6e284@mail.gmail.com> Message-ID: <20090919200558.GA8642@banane.novuscom.net> On Sat, Sep 19, 2009 at 07:14:23PM +0100, Ren? Dudfield wrote: > Yes, I think the callables will need a set method... rather than > letting the middleware replace callables. > > I think this could be used for middleware: > environ['SCRIPT_NAME'](set = "/bla/", urldecoding = False, encoding ='utf-8') > > but then this(one callable) would probably be better ;) > environ(what='SCRIPT_NAME', set = "/bla/", urldecoding = False, > encoding ='utf-8') > > Since changing the middleware could potentially trigger the rest of > the decoding. In some situations you would want to avoid reading from > the socket at all. So middleware changing stuff would mean you would > need to read from the socket(obviously you need to read stuff before > changing it). > > Why would you not want to read from the socket at all? (wsgi 1.0 > makes these impossible) > - to block certain hosts by looking at their ip. > - you might just care about a connection, like any connection > triggers an action. > - for load balancing > - to look at the port number, eg, to check if port 443 is used. > - if you are overloaded(dos), you want to drop the connection right away. > - ... others. > > So allowing the server to avoid most processing before the application > requests certain data could be a good thing. > > So with middleware changing the environ, it means that all those > callables need to be linked to allow the rest of them to know > something has been changed. So that when one thing is changed, it > drops back to wsgi 1.0 behaviour - that is, some of the encoding is > done just before any change is allowed. > > Or maybe middleware has to call a environ['changing']() callable. > Which could then trigger the callables internal transcoding and socket > reading etc. > > > I'm not sure if it will make middleware harder to use or not still. > I'm working through the function Philip sent to see how it turns out, > and will send an updated proposal after that. All this is a lot of 'added functionality'. What made WSGI nice is that it was using only basic Python types. We should try to simplify the interface, not try to make it 'smarter' by adding clever features. Let the frameworks do that. -- Henry Pr?cheur From ianb at colorstudy.com Sun Sep 20 06:46:34 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Sat, 19 Sep 2009 23:46:34 -0500 Subject: [Web-SIG] Unicode in Python 3 In-Reply-To: <4AB4B530.7080000@active-4.com> References: <4AB4B530.7080000@active-4.com> Message-ID: I can't read all this thread carefully, too much stuff. I will note however that people are STILL ignoring surrogateescape (http://www.python.org/dev/peps/pep-0383/). This is like the third or fourth time I've brought it up. It was added to Python 3.1 for some of the exact issues we are encountering. Particularly, imagine someone requests /foo%efbar (which is not valid UTF-8). >>> SCRIPT_NAME = b'/foo\xefbar' # after url unquoting (urllib.request.unquote doesn't work for this currently) >>> s = SCRIPT_NAME.decode('utf8', 'surrogateescape') >>> s '/foo\udcefbar' >>> s.encode('utf8', 'surrogateescape') b'/foo\xefbar' So we can have unicode values that can be safely and correctly transcoded to other encodings (or handled in their raw form). The constraints on surrogateescape are: * You have to use 'surrogateescape' during decoding and encoding (I think for decoding it should be part of the spec) * You have to know the encoding; doing s.encode('latin1', 'surrogateescape') wouldn't necessarily preserve the correct bytes (it does for this example, but wouldn't if there was a mix of valid UTF-8 and invalid bytes) And there's a bit of an annoyance to the fact that SCRIPT_NAME/PATH_INFO should always be treated as UTF-8 (which might sometimes be wrong, but for any modern app/browser will be right), but maybe other parts (HTTP_COOKIE?) are in "native" encoding. Well, besides HTTP_COOKIE, I don't know what else would be in a different encoding. Atompub adds Slug, but it's a URL/IRI, so it should be ASCII. I have seen proposals for a Title header (e.g., when PUTting an image and giving it a title), and that could be unicode. But in all those cases it'll be a modern app and modern clients, and in those cases people just use UTF-8. Frankly I'm open to UTF-8-everywhere. People mentioned Jack and Rack, and to what degree that works, it probably works because everyone uses UTF-8. With surrogateescape we allow transcoding when needed (e.g., if you wanted to handle redirects from old/weird non-UTF-8 URLs) but keep things reasonably simple otherwise. Ian From armin.ronacher at active-4.com Sun Sep 20 14:31:23 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 20 Sep 2009 14:31:23 +0200 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> Message-ID: <4AB6209B.4010800@active-4.com> Hi, Graham Dumpleton schrieb: > Regardless of the details of changes being made to the PEP and the > creation of any new ones, do we need to first agree on the overall > direction we are going to take. Ie., the grand plan at a high level. Indeed. The 0333 changes are mostly uncontroversial and can be discovered separately. So far the discussions on this mailinglist in the last days only covered what would be a new WSGI version which is in the XXXX file. > What I am getting at here is that the likes of PJE has indicated a > preference for skipping any WSGI 1.1 altogether and going straight to > WSGI 2.0. If there isn't going to be support all round for even coming > out with WSGI 1.1, then don't want to see time wasted trying to come > up with a new PEP only for what is needed to change. The time wasted on XXXX is not that much, it's just your #3 written down to text with the unicode return values. > So, I am starting to get nervous that we could go to a great deal of > work to try and resolve the various issues for a specific definition, > only to find that people don't even agree that such a version is > warranted and we get a deadlock. WSGI 1.1 as currently specified in XXXX would be pretty uncontroversial on Python 2.x because of the str/unicode coercion that Python implicitly applies and that this is basically the only change. > 1. Clarifications and corrections to existing WSGI for Python 2.X Is already in 0333 in the repository. > 2. Come up with a version of WSGI for Python 3.X. The whole bytes > versus unicode discussion. That is in XXXX, just that this new version of WSGI also works in Python 2.x and is unicode based. > 3. Drop the start_response() function and ability to use its write() > function returned as result. What people have been calling WSGI 2.0. That would be too many changes at the same time. We can specify WSGI 2.0 at the same time based on XXXX and just change the return value to ``(app_iter, status, headers)`` and drop the `start_response`. But that really breaks applications and workflows and I don't think everybody would swtich over to that right away. > The first question is, should Python 2.X forever be bytes everywhere, > or if we start introducing unicode [...] Latest version of XXXX specifies ist as unicode for 2.x and 3.x except where native strings still make sense. > In my definitions I introduced 'native' string along with 'bytes' and > 'unicode' string in an attempt to try and be able to use one set of > language which would describe WSGI and be interpretable in the context > of both Python 2.X and Python 3.X. XXXX is basically that. > The second question is, do we want to try and come up with something > for Python 3.X, ie., (2) above, while still preserving the current > start_response() callback, or do we instead want to jump direct to > WSGI (Python 3.X) 2.0, ie., combine (2) and (3) above, and say that > there is no WSGI 1.X for Python 3.X at all? XXXX does not drop start_response. That would break too much code (all middlewares and it's not straightforward to write middlewares for both start_response and without then). > For example, one option for a roadmap would keep bytes everywhere in > Python 2.X and jump direct to WSGI 2.0 in Python 3.X. IMO WSGI 1.0 should just fix the small problem it has, and WSGI 1.1 goes to unicode in both versions. > WSGI (Python 2.X) 1.1 - Clarify existing WSGI by adding (1) above. > WSGI (Python 2.X) 2.0 - Drop start_response() from WSGI (Python 2.X) > 1.1. Keep bytes everywhere. > WSGI (Python 3.X) 2.0 - Adapt WSGI (Python 2.X) 2.0 to Python 3.X. Use > definition #4 (or more likely a variation on it). For that I would rather go like this: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a modified version of PEP XXXX WSGI 3.0 like XXX but drops start_response > One reason for still keeping bytes everywhere in Python 2.X is that is > because how it is and if unicode introduced then possibly would just > be ignored by people anyway. If WSGI 2.0 based on the list above introduces unicode to both Python 2.x and Python 3.x not much would change for the user. Frameworks are already using unicode everywhere already, if the decoding step happens in the webserver they just would have to make their own decoding a NOOP if they detect version (2, 0). > Second reason is whereby Ian is promoting PEP 0383 as way of resolving > transcoding issues If we want to be WSGI still Python 2.x compliant for all version (which I hope we do), that is out of the question. Also latin1 is fine because it's actually what HTTP speaks and does not drop any information. For URIs we do what browsers do already which also does not lose any information. Don't see what 0383 gives us we can't have with what you and Robert are already doing. > So, perhaps we can step back for a minute and ask those couple of > major questions. To state them again, they were: > > 1. Do we keep bytes everywhere forever in Python 2.X, or try to > introduce unicode there at all to at least mirror what changes might > be made to make WSGI workable in Python 3.X? -1, Specifications should work the same in 2.x and 3.x > 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to > WSGI 2.0 for Python 3.X? Yes. > I would like to see all the major players, ie., Robert, Armin, PJE and > Ian, plus if possible, major developers on packages like Pylons, TG, > Django, Zope/Repoze etc, at least comment on these two questions. You got my answers. :) Regards, Armin From armin.ronacher at active-4.com Sun Sep 20 14:36:12 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 20 Sep 2009 14:36:12 +0200 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <4AB6209B.4010800@active-4.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> Message-ID: <4AB621BC.1090305@active-4.com> Hi, Armin Ronacher schrieb: > WSGI 1.1 as currently specified in XXXX would be pretty uncontroversial > on Python 2.x because of the str/unicode coercion that Python implicitly > applies and that this is basically the only change. Based on the table, XXXX is 2.0 now. > That would be too many changes at the same time. We can specify WSGI > 2.0 at the same time based on XXXX Would be 3.0 then. > IMO WSGI 1.0 should just fix the small problem it has, and WSGI 1.1 goes > to unicode in both versions. Based on the table. 1.0 is 1.1 and 1.1 is 2.0. I hope that unconfuses my mail, but I'm pretty sure it did not :) Regards, Armin From graham.dumpleton at gmail.com Sun Sep 20 14:37:28 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Sun, 20 Sep 2009 22:37:28 +1000 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <4AB6209B.4010800@active-4.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> Message-ID: <88e286470909200537w70e752b1g7003feede34105cc@mail.gmail.com> 2009/9/20 Armin Ronacher : > For that I would rather go like this: > > WSGI 1.0 ? ? ? stays the same as PEP 0333 currently is > WSGI 1.1 ? ? ? becomes what Ian and I added to PEP 0333 > WSGI 2.0 ? ? ? becomes a modified version of PEP XXXX > WSGI 3.0 ? ? ? like XXX but drops start_response > > > ... > >> 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to >> WSGI 2.0 for Python 3.X? > Yes. Except that when I meant WSGI 2.0, I meant without start_response. You instead bump dropping that out to WSGI 3.0, which is fine. Okay, you just followed up with those clarifications anyway. :-) Graham From robillard.etienne at gmail.com Sun Sep 20 14:48:06 2009 From: robillard.etienne at gmail.com (Etienne Robillard) Date: Sun, 20 Sep 2009 08:48:06 -0400 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <4AB6209B.4010800@active-4.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> Message-ID: <4AB62486.6070905@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Armin Ronacher wrote: > Hi, > > Graham Dumpleton schrieb: >> Regardless of the details of changes being made to the PEP and the >> creation of any new ones, do we need to first agree on the overall >> direction we are going to take. Ie., the grand plan at a high level. > Indeed. The 0333 changes are mostly uncontroversial and can be > discovered separately. So far the discussions on this mailinglist in > the last days only covered what would be a new WSGI version which is in > the XXXX file. > >> What I am getting at here is that the likes of PJE has indicated a >> preference for skipping any WSGI 1.1 altogether and going straight to >> WSGI 2.0. If there isn't going to be support all round for even coming >> out with WSGI 1.1, then don't want to see time wasted trying to come >> up with a new PEP only for what is needed to change. > The time wasted on XXXX is not that much, it's just your #3 written down > to text with the unicode return values. > >> So, I am starting to get nervous that we could go to a great deal of >> work to try and resolve the various issues for a specific definition, >> only to find that people don't even agree that such a version is >> warranted and we get a deadlock. > WSGI 1.1 as currently specified in XXXX would be pretty uncontroversial > on Python 2.x because of the str/unicode coercion that Python implicitly > applies and that this is basically the only change. > >> 1. Clarifications and corrections to existing WSGI for Python 2.X > Is already in 0333 in the repository. > >> 2. Come up with a version of WSGI for Python 3.X. The whole bytes >> versus unicode discussion. > That is in XXXX, just that this new version of WSGI also works in Python > 2.x and is unicode based. > >> 3. Drop the start_response() function and ability to use its write() >> function returned as result. What people have been calling WSGI 2.0. > That would be too many changes at the same time. We can specify WSGI > 2.0 at the same time based on XXXX and just change the return value to > ``(app_iter, status, headers)`` and drop the `start_response`. But that > really breaks applications and workflows and I don't think everybody > would swtich over to that right away. > >> The first question is, should Python 2.X forever be bytes everywhere, >> or if we start introducing unicode [...] > Latest version of XXXX specifies ist as unicode for 2.x and 3.x except > where native strings still make sense. > >> In my definitions I introduced 'native' string along with 'bytes' and >> 'unicode' string in an attempt to try and be able to use one set of >> language which would describe WSGI and be interpretable in the context >> of both Python 2.X and Python 3.X. > XXXX is basically that. > >> The second question is, do we want to try and come up with something >> for Python 3.X, ie., (2) above, while still preserving the current >> start_response() callback, or do we instead want to jump direct to >> WSGI (Python 3.X) 2.0, ie., combine (2) and (3) above, and say that >> there is no WSGI 1.X for Python 3.X at all? > XXXX does not drop start_response. That would break too much code (all > middlewares and it's not straightforward to write middlewares for both > start_response and without then). > >> For example, one option for a roadmap would keep bytes everywhere in >> Python 2.X and jump direct to WSGI 2.0 in Python 3.X. > IMO WSGI 1.0 should just fix the small problem it has, and WSGI 1.1 goes > to unicode in both versions. > >> WSGI (Python 2.X) 1.1 - Clarify existing WSGI by adding (1) above. >> WSGI (Python 2.X) 2.0 - Drop start_response() from WSGI (Python 2.X) >> 1.1. Keep bytes everywhere. >> WSGI (Python 3.X) 2.0 - Adapt WSGI (Python 2.X) 2.0 to Python 3.X. Use >> definition #4 (or more likely a variation on it). > For that I would rather go like this: > > WSGI 1.0 stays the same as PEP 0333 currently is > WSGI 1.1 becomes what Ian and I added to PEP 0333 > WSGI 2.0 becomes a modified version of PEP XXXX > WSGI 3.0 like XXX but drops start_response Good plan but I'm afraid now only a bunch of elite people on this list is going to remember all the details on theses "upcoming" specifications. Why the rush to specify WSGI 3.0 and not focus mainly on the next one ahead ? With regards, Etienne -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkq2JIYACgkQ/VP9MZjcTlePlwCfQKKjLp0ZUyObFJvYbUHIARdY sqwAoJ99JgdKaIVjw5SZfXveFS+tSj7/ =VHY9 -----END PGP SIGNATURE----- From armin.ronacher at active-4.com Sun Sep 20 15:06:14 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 20 Sep 2009 15:06:14 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes Message-ID: <4AB628C6.1000208@active-4.com> Hello everybody, Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ Neither the wording not the changes in there are anywhere near final. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. - unicode support can be added for WSGI on both Python 2.x and Python 3.x without removing functionality. Browsers are already doing a similar encoding trick as proposed by Graham Dumpleton to handle URLs. - Python 2.x already accepts unicode strings for many things such as URL handling thanks to the fact that unicode and byte strings are surprisingly interchangeable. - cgi.FieldStorage and some other parts is now totally broken on Python 3 and should no longer be used in 3.0 and 3.1 because it reads the response body into memory. This currently affects WebOb, Pylons and TurboGears. I sent this mail to every major framework / WSGI implementor so that we get input even if you're missing the discussion on web-sig. Regards, Armin From graham.dumpleton at gmail.com Sun Sep 20 13:37:05 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Sun, 20 Sep 2009 21:37:05 +1000 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <4AB51F65.2010206@active-4.com> References: <4AB51F65.2010206@active-4.com> Message-ID: <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> 2009/9/20 Armin Ronacher : > Hi, > > I know I pretty much SPAM the list here now which is why I added all the > changes of WSGI 1.0 and what could become WSGI 1.1 into a repo on > bitbucket as two PEPS: > > http://bitbucket.org/ianb/wsgi-peps/src/ > > > pep-0333.txt > > This is basically just a new revision for PEP 333 changing the following > things: > > - removing Jython and Python 2.2 compatibility. ?Jython is close enough > ?to modern Python versions now that this does not make any difference. > > - fixing wsgi.input by adding a proper readline(). ?The current version > ?still requires the user to care about not reading past the content > ?length, but if all server implementors agree that could be changed so > ?that the stream provides an end of line marker. > > - mentioning that WSGI 1.0 is not supported by Python 3. > > - made WSGI 1.0 depend on bytes. > > - fixed example code > > - servers may no longer add a date or server header if that header is > ?already present. ?(This MUST may become a SHOULD for the server > ?header as it's probably hard to control for things like mod_wsgi) > > - weakened the rules for buffering and streaming. ?Everybody does it, > ?so it should be allowed. > > - added middleware warning for `wsgi.file_wrapper` > > > pep-XXXX.txt > > This specifies WSGI 1.1 based on #3/#4 in Graham Dumpletons Blog post. > The differences to his proposal: > > - the application iterator must by byte based. ?I would really require > ?that, so that people explicitly encode their stuff as utf-8 instead > ?of yielding latin1. ?If we want to allow unicode return values I > ?strongly encourage using utf-8 for the return value because we already > ?require UTF-8 URLs. > > - clarified wsgi.uri_encoding, that algorithm should not be the default > ?but the only one to make it easier for applications to reencode URIs. > > - Stick to `start_response` and `exc_info` but add deprecation warnings > ?for `exc_info` and `write()`. ?This should make it easier to port > ?applications over. ?Breaking too many APIs at the same time is > ?probably not the best idea. > > > If we really want to get rid of `start_response` at the same time, I > would suggest using ``(appiter, status, headers)`` instead of > ``(status, headers, appiter)``. ?The former is the current common > signature of response objects which would make it possible to convert > from a WSGI application response to a response object by doing something > like this: > > ? response = Response(*wsgi_app(request.environ)) > > The XXXX PEP is currently missing any copyright information and headers > and should only be considered as a draft. Regardless of the details of changes being made to the PEP and the creation of any new ones, do we need to first agree on the overall direction we are going to take. Ie., the grand plan at a high level. What I am getting at here is that the likes of PJE has indicated a preference for skipping any WSGI 1.1 altogether and going straight to WSGI 2.0. If there isn't going to be support all round for even coming out with WSGI 1.1, then don't want to see time wasted trying to come up with a new PEP only for what is needed to change. I actually suggested going straight to WSGI 2.0 back the start of last year and got chastised for making the suggestion. The criticisms back then were because I was saying that since people were going to have to make changes for Python 3.0 anyway, why not enforce an API change at the same time. This didn't go down too well with those who wanted to promote 2to3 as the way of migrating to Python 3.0, even though I was already pointing out that WSGI as it was probably wasn't going to work on Python 3.0. Our own ongoing discussions have proved out that point and that some change will be required to make it usable. I do acknowledge though that I wanted to skip WSGI 1.X altogether for Python 3.0, where as PJE is trying to install his preferred definition #3, and one he has always promoted from day one, as WSGI 1.0 for Python 3.X, even though it doesn't comply with WSGI PEP and by rights shouldn't be called WSGI 1.0. So, I am starting to get nervous that we could go to a great deal of work to try and resolve the various issues for a specific definition, only to find that people don't even agree that such a version is warranted and we get a deadlock. Looking at the bigger picture, there are three overall goals that I can see that we would want to address. 1. Clarifications and corrections to existing WSGI for Python 2.X to allow readline() with size hint, mandatory end of stream sentinel for wsgi.input, support for chunked request content and rules on amount of data that should be returned by WSGI applications and how much data wsg.file_wrapper should send back from a file when Content-Length is defined. These were the points (11) to (16) that I tacked onto my definition #4, in my blog post. They are applicable though to any update to WSGI for any version of Python. 2. Come up with a version of WSGI for Python 3.X. The whole bytes versus unicode discussion. 3. Drop the start_response() function and ability to use its write() function returned as result. What people have been calling WSGI 2.0. To go along with that, there are a couple major questions I think needs to be answered and this will dictate to a degree what any roadmap will be. The first question is, should Python 2.X forever be bytes everywhere, or if we start introducing unicode into parts of the definition for Python 3.X, should those versions of the WSGI specification map those unicode parts back in to the Python 2.X of an equivalent version of the specification? In my definitions I introduced 'native' string along with 'bytes' and 'unicode' string in an attempt to try and be able to use one set of language which would describe WSGI and be interpretable in the context of both Python 2.X and Python 3.X. For definition #4, this mean defining SCRIPT_NAME, PATH_INFO and QUERY_STRING as 'unicode' string. This meant that for Python 2.X, they would as such also be unicode string. The other option was to define them as 'native' string, which means the whole 'wsgi.uri_encoding' flag was only relevant to Python 3.X, as in Python 2.X the native string is 'bytes' and so the whole encoding issue would still be up to the WSGI application as it is now for bytes everywhere WSGI in Python 2.X. In effect, if they were 'native' strings and 'wsgi.uri_encoding' went way, we just have existing WSGI 1.0. The only actual difference was that I was adding on top of definition #4 the clarifications as per (1) above. The second question is, do we want to try and come up with something for Python 3.X, ie., (2) above, while still preserving the current start_response() callback, or do we instead want to jump direct to WSGI (Python 3.X) 2.0, ie., combine (2) and (3) above, and say that there is no WSGI 1.X for Python 3.X at all? For example, one option for a roadmap would keep bytes everywhere in Python 2.X and jump direct to WSGI 2.0 in Python 3.X. WSGI (Python 2.X) 1.1 - Clarify existing WSGI by adding (1) above. WSGI (Python 2.X) 2.0 - Drop start_response() from WSGI (Python 2.X) 1.1. Keep bytes everywhere. WSGI (Python 3.X) 2.0 - Adapt WSGI (Python 2.X) 2.0 to Python 3.X. Use definition #4 (or more likely a variation on it). One reason for still keeping bytes everywhere in Python 2.X is that is because how it is and if unicode introduced then possibly would just be ignored by people anyway. Second reason is whereby Ian is promoting PEP 0383 as way of resolving transcoding issues for Python 3.X. The library functions for PEP 0383 are only in Python 3.1 which straight away says we possibly have to abandon any concept of supporting Python 3.0, but also means not really practical to also push back and start using unicode in Python 2.X either. This is because one of the things that makes writing WSGI adapters easy is that no dependence on a third party package is required. By having to use PEP 0383, you are effectively then bound to Python 3.1+. It would just be a PITA if WSGI adapters had to provide their own implementation of those library functions to support older Python versions or if WSGI adapters had to depend on a third party package not part of Python itself. The second option for a roadmap, if want to start introducing unicode to Python 2.X, and effectively maintain one WSGI specification that works for Python 2.X and Python 3.X, would mirror somewhat what I originally blogged about. WSGI (Python 3.X) 1.0 - Use definition #3, even though it doesn't agree with WSGI 1.0 specification and so should not really be labelled as such. Only really doing this because of fact that wsgiref and some other implementations were using this already. Don't bother to add clarifications in (1) above as can't guarantee existing implementations are implemented that way. WSGI (Python 2.X/3.X) 1.1 - Use definition #4 (or more likely a variation on it). Add clarifications in (1) above. WSGI (Python 2.X/3.X) 2.0 - Drop start_response() from WSGI (Python 2.X/3.X) 1.1. That is two options and there would be others as well. For example, replace in second option WSGI (Python 3.X) 1.0 with the bytes only version of WSGI, ie., use definition #1. So, perhaps we can step back for a minute and ask those couple of major questions. To state them again, they were: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? I would like to see all the major players, ie., Robert, Armin, PJE and Ian, plus if possible, major developers on packages like Pylons, TG, Django, Zope/Repoze etc, at least comment on these two questions. Settling on the overall plan before we go any further would be a good start and avoid have to change course later. Graham From pje at telecommunity.com Sun Sep 20 16:43:52 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sun, 20 Sep 2009 10:43:52 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB628C6.1000208@active-4.com> References: <4AB628C6.1000208@active-4.com> Message-ID: <20090920144350.839F13A403D@sparrow.telecommunity.com> At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote: >Hello everybody, > >Thanks to Graham Dumpleton and Robert Brewer there is some serious >progress on WSGI currently. I proposed a roadmap with some PEP changes >now that need some input. > >Summary: > > WSGI 1.0 stays the same as PEP 0333 currently is > WSGI 1.1 becomes what Ian and I added to PEP 0333 > WSGI 2.0 becomes a unicode powered version of WSGI 1.1 > WSGI 3.0 becomes WSGI 2.0 just without start_response Since there's already a well-established notion of WSGI 2.0 being the new calling convention, I would suggest (to avoid confusion) renaming your "2.0" to "1.2" or "1.5" or something instead. > WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python > 3 because of changes in the standard library that no longer work with > a byte-only approach. This is unfortunate, but it should probably be considered a bellwether for Python 3 porting in general, alas. The Python 3 stdlib *should* work with bytes, and the fact that it does not should be treated as a bug in the stdlib rather than something to be worked around in WSGI. >Graham wrote down two questions he wants every major framework developer >to be answered. These should guide the way to new WSGI standards: > >1. Do we keep bytes everywhere forever in Python 2.X, or try to > introduce unicode there at all to at least mirror what changes might > be made to make WSGI workable in Python 3.X? Technically, we are not using bytes but "native" strings, i.e. type 'str'. What benefit would introducing unicode produce? >2. Do we skip WSGI 1.X completely for Python 3.X and go straight to > WSGI 2.0 for Python 3.X? This discussion has been going on for so long that I've already forgotten what the problem was with just using the original 1.0 spec for 3.X, i.e., using native strings for everything, using latin-1 encoding. The only things I can recall off the top of my head are that the input stream would still be bytes, and that the environment might've used a different encoding. I don't know if such an approach should actually be *recommended*, but having a migration path for WSGI 1.0-> Python 3.X sounds like a good idea, if it can be done strictly as errata/clarification of the existing spec. Otherwise, might as well forget the whole thing and go straight to the latest and greatest (i.e. what has previously been called 2.0 and you're calling 3.0.) >I added a new question I think should be asked too: > >3. Do we skip WSGI 2.0 as specified in the PEP and go straight to > WSGI 3.0 and drop start_response? I suggest skipping straight to the latest and greatest with no in-betweens at all, other than errata/clarifications on 1.0. Having lots of variations of a "standard" is a bug, not a feature! >The following things became pretty clear when playing around with >various specifications on Python 3: > >- Python 3 no longer implicitly converts between unicode and byte > strings. This covers comparisons, the regular expression engine, > all string functions and many modules in the stdlib. >- The Python 3 stdlib radically moved to unicode for non unicode things > as well (the http servers, http clients, url handling etc.) > >- A byte only version of WSGI appears unrealistic on Python 3 because > it would require server and middleware implementors to reimplement > parts of the standard library to work on bytes again. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? From pje at telecommunity.com Sun Sep 20 16:51:02 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sun, 20 Sep 2009 10:51:02 -0400 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <4AB62486.6070905@gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> <4AB62486.6070905@gmail.com> Message-ID: <20090920145103.2B0B53A407A@sparrow.telecommunity.com> At 08:48 AM 9/20/2009 -0400, Etienne Robillard wrote: >Good plan but I'm afraid now only a bunch of elite people on this list >is going to remember all the details on theses "upcoming" >specifications. Why the rush to specify WSGI 3.0 and not focus >mainly on the next one ahead ? Because having more versions of the spec is a bug, not a feature. How many versions will a server or framework be reasonably expected to support? Also, note that changing the calling convention ensures that you can't accidentally run an old application with a server that will be sending it different information than it expects. This ensures that we can have clean boundary conditions between spec versions, and can hopefully just have a few, well-tested and battle-hardened converters. From armin.ronacher at active-4.com Sun Sep 20 16:50:36 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 20 Sep 2009 16:50:36 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090920144350.839F13A403D@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> Message-ID: <4AB6413C.5030001@active-4.com> Hi, P.J. Eby schrieb: > This discussion has been going on for so long that I've already > forgotten what the problem was with just using the original 1.0 spec > for 3.X, i.e., using native strings for everything, using latin-1 > encoding. The only things I can recall off the top of my head are > that the input stream would still be bytes, and that the environment > might've used a different encoding. Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and many more technologies are based on unicode, even in Python 2.x. They are currently doing decoding of byte data internally. In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever we suddenly have different code paths for Python 3 and Python 2. Because in Python 3 we suddendly already have unicode data. You're assuming a situation where the applicaiton in Python 2.x was byte based, but in the majority of cases this is never the situation. > IMO, this strongly suggests that it's the stdlib or Python 3 that's > broken here. How much of the stdlib are we talking about needing to > reimplement, aside from cgi.FieldStorage? I'm already creating a patch for urllib which currently requires unicode. I'm not sure about what to do with cgi.FieldStorage, in general I would not recommend using the cgi module for WSGI applications at all! If we would go with bytes for the WSGI 1.0 spec on Python 3 a WSGI server also has to decode that data from the Server again. Also (something I haven't yet filed as a bug because I guess there will be more changes involved) the HTTP server in Python 3.1 does not support non-ASCII headers. Regards, Armin From renesd at gmail.com Sun Sep 20 17:21:34 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Sun, 20 Sep 2009 16:21:34 +0100 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <20090920145103.2B0B53A407A@sparrow.telecommunity.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> <4AB62486.6070905@gmail.com> <20090920145103.2B0B53A407A@sparrow.telecommunity.com> Message-ID: <64ddb72c0909200821j1906b46at8b8693706cd02315@mail.gmail.com> On Sun, Sep 20, 2009 at 3:51 PM, P.J. Eby wrote: > At 08:48 AM 9/20/2009 -0400, Etienne Robillard wrote: >> >> Good plan but I'm afraid now only a bunch of elite people on this list >> is going to remember all the details on theses "upcoming" >> specifications. Why the rush to specify WSGI 3.0 and not focus >> mainly on the next one ahead ? > > Because having more versions of the spec is a bug, not a feature. ?How many > versions will a server or framework be reasonably expected to support? > > Also, note that changing the calling convention ensures that you can't > accidentally run an old application with a server that will be sending it > different information than it expects. ?This ensures that we can have clean > boundary conditions between spec versions, and can hopefully just have a > few, well-tested and battle-hardened converters. > +1 for one new version with different behaviour. From pje at telecommunity.com Sun Sep 20 17:22:13 2009 From: pje at telecommunity.com (P.J. Eby) Date: Sun, 20 Sep 2009 11:22:13 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB6413C.5030001@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> <4AB6413C.5030001@active-4.com> Message-ID: <20090920152210.B61CF3A403D@sparrow.telecommunity.com> At 04:50 PM 9/20/2009 +0200, Armin Ronacher wrote: >Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and >many more technologies are based on unicode, even in Python 2.x. They >are currently doing decoding of byte data internally. > >In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever >we suddenly have different code paths for Python 3 and Python 2. >Because in Python 3 we suddendly already have unicode data. No, you'd have bytes stored in a latin-1 string, which is not quite the same thing as "already [having] unicode data". You have to .encode('latin1').decode(targetencoding) if you want genuine unicode data. If you're saying that people's code would have to change when they go to Python 3 (i.e., adding the extra .encode()), I think that's already a given for *any* non-trivial code, not just WSGI. > > IMO, this strongly suggests that it's the stdlib or Python 3 that's > > broken here. How much of the stdlib are we talking about needing to > > reimplement, aside from cgi.FieldStorage? >I'm already creating a patch for urllib which currently requires >unicode. I'm not sure about what to do with cgi.FieldStorage, in >general I would not recommend using the cgi module for WSGI applications >at all! But people do, in fact, use it for WSGI on 2.x, so if having "different code paths" is a problem, certainly dropping the cgi module is at least as big of a problem, if not considerably more so. I think one of the reasons that the current (and ongoing) PEP discussions have been foundering is that there isn't a clear delineation of goals at the high level, and rather just a bunch of tradeoff discussions, absent any criteria by which to make the tradeoffs. To me, I'd rather see people port to a new WSGI spec (with a new calling convention) on Python 2, and only *then* transition to Python 3. If we do that well, then the intermediate pain disappears -- as does the pain and complexity of trying to make a bastardized in-between specification. ;-) Truth be told, we can probably do that new spec *faster* if we don't have to worry too much about backward compatibility, and just design it for the way things are now, instead of worrying about the past. Even if we have to do some odd things inside a 2-to-1 converter, there should ideally only have to be a handful of such converters ever written. From g.brandl at gmx.net Sun Sep 20 17:29:21 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Sun, 20 Sep 2009 17:29:21 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090920144350.839F13A403D@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> Message-ID: P.J. Eby schrieb: >>- Python 3 no longer implicitly converts between unicode and byte >> strings. This covers comparisons, the regular expression engine, >> all string functions and many modules in the stdlib. >>- The Python 3 stdlib radically moved to unicode for non unicode things >> as well (the http servers, http clients, url handling etc.) >> >>- A byte only version of WSGI appears unrealistic on Python 3 because >> it would require server and middleware implementors to reimplement >> parts of the standard library to work on bytes again. > > IMO, this strongly suggests that it's the stdlib or Python 3 that's > broken here. How much of the stdlib are we talking about needing to > reimplement, aside from cgi.FieldStorage? FWIW, it's very much possible that the py3k stdlib is broken there. Many modules were "ported" with the aim "get the test running again", and not too much thought about bytes/unicode issues. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From graham.dumpleton at gmail.com Sun Sep 20 22:52:16 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 06:52:16 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB6413C.5030001@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> <4AB6413C.5030001@active-4.com> Message-ID: <88e286470909201352j76bdfec9o22374915859041e7@mail.gmail.com> 2009/9/21 Armin Ronacher : >> IMO, this strongly suggests that it's the stdlib or Python 3 that's >> broken here. ?How much of the stdlib are we talking about needing to >> reimplement, aside from cgi.FieldStorage? > I'm already creating a patch for urllib which currently requires > unicode. ?I'm not sure about what to do with cgi.FieldStorage, in > general I would not recommend using the cgi module for WSGI applications > at all! ?If we would go with bytes for the WSGI 1.0 spec on Python 3 a > WSGI server also has to decode that data from the Server again. > > Also (something I haven't yet filed as a bug because I guess there will > be more changes involved) the HTTP server in Python 3.1 does not support > non-ASCII headers. Read the following first: http://bugs.python.org/issue4953 http://bugs.python.org/issue4661 There the ones I know about that affect cgi.FieldStorage. Graham From graham.dumpleton at gmail.com Mon Sep 21 01:34:33 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 09:34:33 +1000 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <20090920145103.2B0B53A407A@sparrow.telecommunity.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <4AB6209B.4010800@active-4.com> <4AB62486.6070905@gmail.com> <20090920145103.2B0B53A407A@sparrow.telecommunity.com> Message-ID: <88e286470909201634t75c5b300oac443b32b892c5d@mail.gmail.com> 2009/9/21 P.J. Eby : > At 08:48 AM 9/20/2009 -0400, Etienne Robillard wrote: >> >> Good plan but I'm afraid now only a bunch of elite people on this list >> is going to remember all the details on theses "upcoming" >> specifications. Why the rush to specify WSGI 3.0 and not focus >> mainly on the next one ahead ? > > Because having more versions of the spec is a bug, not a feature. ?How many > versions will a server or framework be reasonably expected to support? I think there is perhaps two aspects to the original question about why even project ahead to WSGI 3.0 (no start_response) at this point. The first is why we want to even be considering dropping start_response at all at this point if people see the current way of doing things as reasonable. The second is whether any effort in drafting a new specification can be avoided at all by going direct to WSGI 3.0, combining amendments, going unicode and dropping start_response all in one go. I have a few views on this. The first is that although I probably have the most difficult job to implement multiple WSGI versions, given that mod_wsgi is all in C code with no Python code at all, the changes we are talking about at the moment aren't that drastic that can't relatively easily support multiple WSGI versions. In fact, mod_wsgi already implements what we are talking about in WSGI 1.1. This is because WSGI 1.1 is more about providing guarantees to the WSGI application based on how the majority of WSGI adapters/servers already work. I would expect it will be even simpler for WSGI adapters/servers implemented in pure Python to cope with multiple versions. In respect of defining all the versions now as a well defined roadmap, from the adapter/server side, you need to realise that existing implementations have become quite stable. As such, the frequency of any updates to them is going to get larger and larger. For me, I would rather add in the support for WSGI 1.1/2.0/3.0 now, knowing that likely might not make another major version release of mod_wsgi for a year or more, if at all. If mod_wsgi is stable and ends up don't see a need to go and implement further new features I have speculated on, then I don't want to come back in a year just to add WSGI 3.0 support. Further, given that Python 3.X is going to be a trigger point for people to at least make changes to unicode, it would be nice to have WSGI 3.0 out there as a separate additional step they can consider at the same time. Ultimately I feel it will be the general masses and not WSGI adapter/server implementers who will make the decision about how big a jump they will want to make. Enforcing a jump to WSGI 3.0 may not be looked on favourably given that that is a much more significant change. Think of all the problems with migration to Python 3.X as far as people waiting for third party modules to be updated. You will see the same problem with WSGI components if we go direct to WSGI 3.0. As WSGI 2.0 for Python 2.X is, due to way bytes/unicode can still be used interchangeably, it is a much simpler transition and I can see a quick movement to that, or even for the WSGI components to work with all of WSGI 1.0/1.1/2.0 at the same time. For many, if they are smart about it, they may even be able to easily support WSGI 3.0 as well at same time as 1.0/1.1/2.0. The key here would probably be us defining prototype skeletons for code which exhibits how supporting multiple versions in WSGI components could be done easily. As to the idea that going direct to WSGI 3.0 will save us some work in drafting any specification at least, I don't see that it will on the basis that at the moment the only difference we are talking about between WSGI 2.0/3.0 is dropping of start_response(). This is because bits of the other stuff already talked about in relation to what we were calling WSGI 2.0 previously, is already a part of what we are now saying is WSGI 1.1/2.0. In other words, the only important thing that has got deferred to WSGI 3.0 is dropping of start_response(). Some of the other comments seem to indicate that perhaps some have much more drastic changes in mind than just that. If you do, then perhaps you want to outline what those other changes may be so can gauge how significant they are. Even if people decide to only go as far as WSGI 2.0 at this point, I would still likely implement WSGI 3.0 as an experimental feature just so people can play with it directly rather than have to fiddle with adapters. It wouldn't be the default version anyway, so its present isn't going to affect the general population. BTW, Python 2.6 can itself be seen as a transitional version in some ways between Python 2.5 and Python 3.0. So, not like the idea of WSGI 2.0 being a transitional version is unique. :-) Graham From fumanchu at aminus.org Mon Sep 21 03:25:00 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sun, 20 Sep 2009 18:25:00 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB628C6.1000208@active-4.com> References: <4AB628C6.1000208@active-4.com> Message-ID: Armin Ronacher wrote: > Thanks to Graham Dumpleton and Robert Brewer there is some serious > progress on WSGI currently. I proposed a roadmap with some PEP changes > now that need some input. > > Summary: > > WSGI 1.0 stays the same as PEP 0333 currently is > WSGI 1.1 becomes what Ian and I added to PEP 0333 > WSGI 2.0 becomes a unicode powered version of WSGI 1.1 > WSGI 3.0 becomes WSGI 2.0 just without start_response > > WSGI 1.0 and 1.1 are byte based and nearly impossible to use on > Python > 3 because of changes in the standard library that no longer work with > a byte-only approach. > > > The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ > Neither the wording not the changes in there are anywhere near final. > > > Graham wrote down two questions he wants every major framework > developer > to be answered. These should guide the way to new WSGI standards: > > 1. Do we keep bytes everywhere forever in Python 2.X, or try to > introduce unicode there at all to at least mirror what changes might > be made to make WSGI workable in Python 3.X? I'm happy either way, since CherryPy abstracts it all away. Decide already and I'll implement it. > 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to > WSGI 2.0 for Python 3.X? +1 for skipping straight to unicode in Python 3. But call it "1.1" not "2.0". > I added a new question I think should be asked too: > > 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to > WSGI 3.0 and drop start_response? No. We need more time to discuss and try to implement the large architectural changes in that. I need to ship CP 3.2 soon and would like it to have a better Python 3 story than the "bytes-everywhere" (or "unicode pretending to be bytes") of WSGI 1.0. We have working code, which uses unicode in Python 3. Maybe I'll call it "wsgi.version = (1, 'cp32')" and let the spec come later if we can't see the trees for the forest. Robert Brewer fumanchu at aminus.org From fumanchu at aminus.org Mon Sep 21 03:46:42 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sun, 20 Sep 2009 18:46:42 -0700 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > Looking at the bigger picture, there are three overall goals that I > can see that we would want to address. > > 1. Clarifications and corrections to existing WSGI for Python 2.X to > allow readline() with size hint, mandatory end of stream sentinel for > wsgi.input, support for chunked request content and rules on amount of > data that should be returned by WSGI applications and how much data > wsg.file_wrapper should send back from a file when Content-Length is > defined. These were the points (11) to (16) that I tacked onto my > definition #4, in my blog post. They are applicable though to any > update to WSGI for any version of Python. > > 2. Come up with a version of WSGI for Python 3.X. The whole bytes > versus unicode discussion. > > 3. Drop the start_response() function and ability to use its write() > function returned as result. What people have been calling WSGI 2.0. My goals, in priority order, for the next version(s) of WSGI: 1. Full unicode (not just x00-xFF) in Python 3 for the environ keys and most values (not wsgi.input, for example). 2. Points 11-16 as you described. 3. The ability to upgrade a WSGI1.0/CPython2.x application to CPython3 using 2to3, minimizing ancillary changes, even if that means requiring an upgrade to the WSGI version in the process. 4. Minimize the special cases in any new spec. Note this is at the lowest priority. > To go along with that, there are a couple major questions I think > needs to be answered and this will dictate to a degree what any > roadmap will be. > > The first question is, should Python 2.X forever be bytes everywhere, > or if we start introducing unicode into parts of the definition for > Python 3.X, should those versions of the WSGI specification map those > unicode parts back in to the Python 2.X of an equivalent version of > the specification? CherryPy 3.x on Python 2.x will always use bytes everywhere, as we have always done. So I understand completely if Django, Pylons, etc have "always used" unicode and want to keep doing that. If y'all decide to make a version of WSGI which requires unicode because you think it's easier or more popular, no problem--CherryPy 3.2+ on Python 2 will just convert back to bytes before handing off that data to CherryPy apps. This is one reason why a new "wsgi.url_encoding" entry would be required if SCRIPT_NAME/PATH_INFO/QUERY_STRING become unicode. > In my definitions I introduced 'native' string along with 'bytes' and > 'unicode' string in an attempt to try and be able to use one set of > language which would describe WSGI and be interpretable in the context > of both Python 2.X and Python 3.X. > > For definition #4, this mean defining SCRIPT_NAME, PATH_INFO and > QUERY_STRING as 'unicode' string. This meant that for Python 2.X, they > would as such also be unicode string. The other option was to define > them as 'native' string, which means the whole 'wsgi.uri_encoding' > flag was only relevant to Python 3.X, as in Python 2.X the native > string is 'bytes' and so the whole encoding issue would still be up to > the WSGI application as it is now for bytes everywhere WSGI in Python > 2.X. In effect, if they were 'native' strings and 'wsgi.uri_encoding' > went way, we just have existing WSGI 1.0. The only actual difference > was that I was adding on top of definition #4 the clarifications as > per (1) above. I'd be happy if WSGI 1.1 said "use native" and the "wsgi.uri_encoding" entry was only required on versions of Python where the native string type is unicode. That's an extra paragraph in the spec, yes, so violates my goal 4 a bit, but IMO should not outweigh my goals 1, 2, and 3. > The second question is, do we want to try and come up with something > for Python 3.X, ie., (2) above, while still preserving the current > start_response() callback, or do we instead want to jump direct to > WSGI (Python 3.X) 2.0, ie., combine (2) and (3) above, and say that > there is no WSGI 1.X for Python 3.X at all? I want something in between so I don't have to wait months or years for WSGI 2. I want to ship a version of CherryPy with Python 3 support last week. Robert Brewer fumanchu at aminus.org From fumanchu at aminus.org Mon Sep 21 03:59:38 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sun, 20 Sep 2009 18:59:38 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090920144350.839F13A403D@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> Message-ID: P.J. Eby wrote: > At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote: > >The following things became pretty clear when playing around with > >various specifications on Python 3: > > > >- Python 3 no longer implicitly converts between unicode and byte > > strings. This covers comparisons, the regular expression engine, > > all string functions and many modules in the stdlib. > >- The Python 3 stdlib radically moved to unicode for non unicode > things > > as well (the http servers, http clients, url handling etc.) > > > >- A byte only version of WSGI appears unrealistic on Python 3 because > > it would require server and middleware implementors to reimplement > > parts of the standard library to work on bytes again. > > IMO, this strongly suggests that it's the stdlib or Python 3 that's > broken here. How much of the stdlib are we talking about needing to > reimplement, aside from cgi.FieldStorage? urllib.unquote, for one. We had to make a version which accepts bytes (and outputs bytes). But it's only 8 lines of code. Robert Brewer fumanchu at aminus.org From chrism at plope.com Mon Sep 21 06:25:40 2009 From: chrism at plope.com (Chris McDonough) Date: Mon, 21 Sep 2009 00:25:40 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB628C6.1000208@active-4.com> References: <4AB628C6.1000208@active-4.com> Message-ID: <4AB70044.8010204@plope.com> I'll try to digest some of this, currently I'm pretty clueless. Personally, I find it a bit hard to get excited about Python 3 as a web application deployment platform. This is of course a personal judgment (I don't mean to slight Python 3) but at this point, I'll think I'll probably be writing software that targets 2.X exclusively for at least the next five years. Given this point of view, it would be extremely helpful if someone could explain to people with the same outlook why we should want to deal with Unicode strings in any WSGI specification. WSGI is a fairly low-level protocol aimed at folks who need to interface a server to the outside world. The outside world (by its nature) talks bytes. I fear that any implied conversion of environment values and iterable return values to Unicode will actually eventually make things harder than they are now. I realize that it would make middleware implementors lives harder to need to deal in bytes. However, at this point, I also believe that middleware kinda should be hard. We have way too much middleware that shouldn't be middleware these days (some written by myself). Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to "it will be compatible with Python 3". - C Armin Ronacher wrote: > Hello everybody, > > Thanks to Graham Dumpleton and Robert Brewer there is some serious > progress on WSGI currently. I proposed a roadmap with some PEP changes > now that need some input. > > Summary: > > WSGI 1.0 stays the same as PEP 0333 currently is > WSGI 1.1 becomes what Ian and I added to PEP 0333 > WSGI 2.0 becomes a unicode powered version of WSGI 1.1 > WSGI 3.0 becomes WSGI 2.0 just without start_response > > WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python > 3 because of changes in the standard library that no longer work with > a byte-only approach. > > > The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ > Neither the wording not the changes in there are anywhere near final. > > > Graham wrote down two questions he wants every major framework developer > to be answered. These should guide the way to new WSGI standards: > > 1. Do we keep bytes everywhere forever in Python 2.X, or try to > introduce unicode there at all to at least mirror what changes might > be made to make WSGI workable in Python 3.X? > > 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to > WSGI 2.0 for Python 3.X? > > I added a new question I think should be asked too: > > 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to > WSGI 3.0 and drop start_response? > > > The following things became pretty clear when playing around with > various specifications on Python 3: > > - Python 3 no longer implicitly converts between unicode and byte > strings. This covers comparisons, the regular expression engine, > all string functions and many modules in the stdlib. > > - The Python 3 stdlib radically moved to unicode for non unicode things > as well (the http servers, http clients, url handling etc.) > > - A byte only version of WSGI appears unrealistic on Python 3 because > it would require server and middleware implementors to reimplement > parts of the standard library to work on bytes again. > > - unicode support can be added for WSGI on both Python 2.x and Python > 3.x without removing functionality. Browsers are already doing > a similar encoding trick as proposed by Graham Dumpleton to handle > URLs. > > - Python 2.x already accepts unicode strings for many things such as > URL handling thanks to the fact that unicode and byte strings are > surprisingly interchangeable. > > - cgi.FieldStorage and some other parts is now totally broken on > Python 3 and should no longer be used in 3.0 and 3.1 because it > reads the response body into memory. This currently affects > WebOb, Pylons and TurboGears. > > > I sent this mail to every major framework / WSGI implementor so that we > get input even if you're missing the discussion on web-sig. > > > Regards, > Armin > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com > From mdipierro at cs.depaul.edu Mon Sep 21 07:16:56 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Mon, 21 Sep 2009 00:16:56 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB70044.8010204@plope.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> Message-ID: <9B13C70C-CDE0-44E7-95B3-C937450F192E@cs.depaul.edu> +1 On Sep 20, 2009, at 11:25 PM, Chris McDonough wrote: > I'll try to digest some of this, currently I'm pretty clueless. > > Personally, I find it a bit hard to get excited about Python 3 as a > web > application deployment platform. This is of course a personal > judgment (I > don't mean to slight Python 3) but at this point, I'll think I'll > probably be > writing software that targets 2.X exclusively for at least the next > five years. > > Given this point of view, it would be extremely helpful if someone > could > explain to people with the same outlook why we should want to deal > with Unicode > strings in any WSGI specification. > > WSGI is a fairly low-level protocol aimed at folks who need to > interface a > server to the outside world. The outside world (by its nature) > talks bytes. I > fear that any implied conversion of environment values and iterable > return > values to Unicode will actually eventually make things harder than > they are > now. I realize that it would make middleware implementors lives > harder to need > to deal in bytes. However, at this point, I also believe that > middleware kinda > should be hard. We have way too much middleware that shouldn't be > middleware > these days (some written by myself). > > Anyway, for us slower (and maybe wrongly fearful) folks, could someone > summarize the benefits of having a WSGI specification that requires > Unicode. > Bonus points for an explanation that does not boil down to "it will be > compatible with Python 3". > > - C > > > Armin Ronacher wrote: >> Hello everybody, >> >> Thanks to Graham Dumpleton and Robert Brewer there is some serious >> progress on WSGI currently. I proposed a roadmap with some PEP >> changes >> now that need some input. >> >> Summary: >> >> WSGI 1.0 stays the same as PEP 0333 currently is >> WSGI 1.1 becomes what Ian and I added to PEP 0333 >> WSGI 2.0 becomes a unicode powered version of WSGI 1.1 >> WSGI 3.0 becomes WSGI 2.0 just without start_response >> >> WSGI 1.0 and 1.1 are byte based and nearly impossible to use on >> Python >> 3 because of changes in the standard library that no longer work >> with >> a byte-only approach. >> >> >> The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ >> Neither the wording not the changes in there are anywhere near final. >> >> >> Graham wrote down two questions he wants every major framework >> developer >> to be answered. These should guide the way to new WSGI standards: >> >> 1. Do we keep bytes everywhere forever in Python 2.X, or try to >> introduce unicode there at all to at least mirror what changes >> might >> be made to make WSGI workable in Python 3.X? >> >> 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to >> WSGI 2.0 for Python 3.X? >> >> I added a new question I think should be asked too: >> >> 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to >> WSGI 3.0 and drop start_response? >> >> >> The following things became pretty clear when playing around with >> various specifications on Python 3: >> >> - Python 3 no longer implicitly converts between unicode and byte >> strings. This covers comparisons, the regular expression engine, >> all string functions and many modules in the stdlib. >> >> - The Python 3 stdlib radically moved to unicode for non unicode >> things >> as well (the http servers, http clients, url handling etc.) >> >> - A byte only version of WSGI appears unrealistic on Python 3 >> because >> it would require server and middleware implementors to reimplement >> parts of the standard library to work on bytes again. >> >> - unicode support can be added for WSGI on both Python 2.x and >> Python >> 3.x without removing functionality. Browsers are already doing >> a similar encoding trick as proposed by Graham Dumpleton to handle >> URLs. >> >> - Python 2.x already accepts unicode strings for many things such as >> URL handling thanks to the fact that unicode and byte strings are >> surprisingly interchangeable. >> >> - cgi.FieldStorage and some other parts is now totally broken on >> Python 3 and should no longer be used in 3.0 and 3.1 because it >> reads the response body into memory. This currently affects >> WebOb, Pylons and TurboGears. >> >> >> I sent this mail to every major framework / WSGI implementor so >> that we >> get input even if you're missing the discussion on web-sig. >> >> >> Regards, >> Armin >> _______________________________________________ >> Web-SIG mailing list >> Web-SIG at python.org >> Web SIG: http://www.python.org/sigs/web-sig >> Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com >> > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cs.depaul.edu From armin.ronacher at active-4.com Mon Sep 21 07:57:30 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 07:57:30 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB70044.8010204@plope.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> Message-ID: <4AB715CA.1070404@active-4.com> Hi, Chris McDonough schrieb: > Personally, I find it a bit hard to get excited about Python 3 as a web > application deployment platform. Everybody feels that way currently. But if we don't fix WSGI that will never change. > Given this point of view, it would be extremely helpful if someone could > explain to people with the same outlook why we should want to deal with Unicode > strings in any WSGI specification. I summarized the reasons in my mail. Also have a look at the discussions in this mailinglist that lead to that. Regards, Armin From armin.ronacher at active-4.com Mon Sep 21 08:00:34 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 08:00:34 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> Message-ID: <4AB71682.2070201@active-4.com> Hi, Robert Brewer schrieb: > urllib.unquote, for one. We had to make a version which accepts bytes > (and outputs bytes). But it's only 8 lines of code. Here a patch for urllib.parse that restores Python 2.x behavior. Because it also changes behavior for Python 3.x I have not yet submitted it for discussions: http://paste.pocoo.org/show/140739/ This adds byte support for all unquoting functions and URL parsing and joining. It also changes the quoting functions to return bytes when passed bytes. The latter is something that most likely does not survive a review on python-dev. Regards, Armin From ubernostrum at gmail.com Mon Sep 21 08:03:48 2009 From: ubernostrum at gmail.com (James Bennett) Date: Mon, 21 Sep 2009 01:03:48 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB70044.8010204@plope.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> Message-ID: <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough wrote: > WSGI is a fairly low-level protocol aimed at folks who need to interface a > server to the outside world. ?The outside world (by its nature) talks bytes. > ?I fear that any implied conversion of environment values and iterable > return values to Unicode will actually eventually make things harder than > they are now. ?I realize that it would make middleware implementors lives > harder to need to deal in bytes. ?However, at this point, I also believe > that middleware kinda should be hard. ?We have way too much middleware that > shouldn't be middleware these days (some written by myself). Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an interface to HTTP should deal in bytes as well. The problem, really is that despite being a very low-level interface, WSGI has a tendency to leak up into much higher-level code, and (IMO) authors of that high-level code really shouldn't have to waste their time dealing with details of the underlying low-level gateway. You've said you don't want to hear "Python 3" as the reason, but it provides some useful examples: in high-level code you'll commonly want to be doing things like, say, comparing parts of the requested URL path to known strings or patterns. And that high-level code will almost certainly use strings, while WSGI, in theory, will be using bytes. That's just a recipe for disaster; if WSGI mandates bytes, then bytes will have to start "infecting" much higher-level code (since Python 3 -- rightly -- doesn't let you be nearly as promiscuous about mixing bytes and strings). Once I'm at a point where I can use Python 3, I know I'll personally be looking for some library which will normalize everything for me before I interact with it, precisely to avoid this sort of leakage; if WSGI itself would at least *allow* that normalization to happen at the low level (mandating it is another discussion entirely) I'd feel much happier about it going forward. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From armin.ronacher at active-4.com Mon Sep 21 08:28:25 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 08:28:25 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> Message-ID: <4AB71D09.3040006@active-4.com> Hi, James Bennett schrieb: > Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an > interface to HTTP should deal in bytes as well. If it was just that I would be happy to stay with bytes. But unless the standard library changes in the way it works on Python 3 there is not much but unicode we can use. bytes no longer behave like strings, it's not very comfortable to work with them. Regards, Armin From ubernostrum at gmail.com Mon Sep 21 08:58:12 2009 From: ubernostrum at gmail.com (James Bennett) Date: Mon, 21 Sep 2009 01:58:12 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB71D09.3040006@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB71D09.3040006@active-4.com> Message-ID: <21787a9f0909202358o24e04542q49d611d358b971d4@mail.gmail.com> On Mon, Sep 21, 2009 at 1:28 AM, Armin Ronacher wrote: > If it was just that I would be happy to stay with bytes. ?But unless the > standard library changes in the way it works on Python 3 there is not > much but unicode we can use. ?bytes no longer behave like strings, it's > not very comfortable to work with them. Indeed. Hence my comments about WSGI leaking up into other code. Now that bytes and strings are incompatible, a lot of code which relied on (arguably) a wart in Python will break. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From chrism at plope.com Mon Sep 21 09:10:32 2009 From: chrism at plope.com (Chris McDonough) Date: Mon, 21 Sep 2009 03:10:32 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> Message-ID: <4AB726E8.8090604@plope.com> OK, after some consideration, I think I'm sold. Answering my own original question about why unicode seems to make sense as values in the WSGI environment even without consideration for Python 3 compatibility: *something* needs to do this translation. Currently I personally rely on WebOb to do a lot of this translation. I can't think of a good reason that implementations at the level of WebOb would each need to do this translation work; pushing the job into WSGI itself seems to make sense here. This is particularly true for PATH_INFO and QUERY_STRING; these days it's foolish to assume these values will be entirely composed of "low order" characters, and thus being able to access them as bytes natively isn't very useful. OTOH, I suspect the Python 3 stdlib is still broken if it requires native strings in various places (and prohibits the use of bytes). James Bennett wrote: > On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough wrote: >> WSGI is a fairly low-level protocol aimed at folks who need to interface a >> server to the outside world. The outside world (by its nature) talks bytes. >> I fear that any implied conversion of environment values and iterable >> return values to Unicode will actually eventually make things harder than >> they are now. I realize that it would make middleware implementors lives >> harder to need to deal in bytes. However, at this point, I also believe >> that middleware kinda should be hard. We have way too much middleware that >> shouldn't be middleware these days (some written by myself). > > Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an > interface to HTTP should deal in bytes as well. > > The problem, really is that despite being a very low-level interface, > WSGI has a tendency to leak up into much higher-level code, and (IMO) > authors of that high-level code really shouldn't have to waste their > time dealing with details of the underlying low-level gateway. > > You've said you don't want to hear "Python 3" as the reason, but it > provides some useful examples: in high-level code you'll commonly want > to be doing things like, say, comparing parts of the requested URL > path to known strings or patterns. And that high-level code will > almost certainly use strings, while WSGI, in theory, will be using > bytes. That's just a recipe for disaster; if WSGI mandates bytes, then > bytes will have to start "infecting" much higher-level code (since > Python 3 -- rightly -- doesn't let you be nearly as promiscuous about > mixing bytes and strings). > > Once I'm at a point where I can use Python 3, I know I'll personally > be looking for some library which will normalize everything for me > before I interact with it, precisely to avoid this sort of leakage; if > WSGI itself would at least *allow* that normalization to happen at the > low level (mandating it is another discussion entirely) I'd feel much > happier about it going forward. > > From renesd at gmail.com Mon Sep 21 09:50:35 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 08:50:35 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB71D09.3040006@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB71D09.3040006@active-4.com> Message-ID: <64ddb72c0909210050w208bbf03k104d2bbd2d388974@mail.gmail.com> On Mon, Sep 21, 2009 at 7:28 AM, Armin Ronacher wrote: > Hi, > > James Bennett schrieb: >> Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an >> interface to HTTP should deal in bytes as well. > If it was just that I would be happy to stay with bytes. ?But unless the > standard library changes in the way it works on Python 3 there is not > much but unicode we can use. ?bytes no longer behave like strings, it's > not very comfortable to work with them. > I think http traffic is increasingly more utf-8 these days. Also most upper level frame works use unicode natively. So it makes sense to use utf-8 natively, as an option. From renesd at gmail.com Mon Sep 21 09:54:27 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 08:54:27 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB726E8.8090604@plope.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB726E8.8090604@plope.com> Message-ID: <64ddb72c0909210054u14c60e9i18b83f75ba42063e@mail.gmail.com> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough wrote: > > OTOH, I suspect the Python 3 stdlib is still broken if it requires native > strings in various places (and prohibits the use of bytes). yes, python3 stdlib should support 'str'(the old unicode), 'buffer' and 'bytes' for web using stuff. Buffer is important because it's a type also used for sockets(along with bytes) and it allows less memory allocation (because you can reuse buffers). cheers, From renesd at gmail.com Mon Sep 21 10:08:43 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 09:08:43 +0100 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> Message-ID: <64ddb72c0909210108y687ec572p9a55121e552cf53f@mail.gmail.com> On Mon, Sep 21, 2009 at 2:46 AM, Robert Brewer wrote: ... > I want something in between so I don't have to wait months or years for > WSGI 2. I want to ship a version of CherryPy with Python 3 support last > week. +1 for wsgi 1.1 *very soon* using the "wsgi.url_encoding" idea Graham made for unicode. With the next WSGI afterwards being an 'anything goes' spec, which addresses all other issues and can come later (including async, using buffers, and every other idea people can come up with). cheers, From g.brandl at gmx.net Mon Sep 21 10:46:26 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 21 Sep 2009 10:46:26 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210054u14c60e9i18b83f75ba42063e@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB726E8.8090604@plope.com> <64ddb72c0909210054u14c60e9i18b83f75ba42063e@mail.gmail.com> Message-ID: Ren? Dudfield schrieb: > On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough wrote: >> >> OTOH, I suspect the Python 3 stdlib is still broken if it requires native >> strings in various places (and prohibits the use of bytes). > > yes, python3 stdlib should support 'str'(the old unicode), 'buffer' > and 'bytes' for web using stuff. Buffer is important because it's a > type also used for sockets(along with bytes) and it allows less memory > allocation (because you can reuse buffers). Please don't confuse readers and use the correct name, i.e. 'bytearray' instead of 'buffer'. Georg From renesd at gmail.com Mon Sep 21 11:08:50 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 10:08:50 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB726E8.8090604@plope.com> <64ddb72c0909210054u14c60e9i18b83f75ba42063e@mail.gmail.com> Message-ID: <64ddb72c0909210208y53f0a444n9746d4fde82f01b9@mail.gmail.com> On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl wrote: > Ren? Dudfield schrieb: >> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough wrote: >>> >>> OTOH, I suspect the Python 3 stdlib is still broken if it requires native >>> strings in various places (and prohibits the use of bytes). >> >> yes, python3 stdlib should support 'str'(the old unicode), 'buffer' >> and 'bytes' for web using stuff. ?Buffer is important because it's a >> type also used for sockets(along with bytes) and it allows less memory >> allocation (because you can reuse buffers). > > Please don't confuse readers and use the correct name, i.e. 'bytearray' > instead of 'buffer'. > > Georg > Let me try and reduce the confusion... There are two different python types the py3k socket module uses: 'bytes' and 'buffer'. 'bytes' is kind of like str in python3... but with reduced functionality (no formatting, less methods etc). buffer is a Py_buffer from the c api. buffer, and bytes in socket: http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into bytearray: http://docs.python.org/3.1/library/functions.html#bytearray bytes: http://docs.python.org/3.1/library/functions.html#bytes buffer: http://docs.python.org/3.1/c-api/buffer.html This is separate, but related to the point of bytes vs unicode. It is really (bytes and buffer) vs unicode - since bytes and buffer can be used with socket. socket never uses a python2 'unicode', or a python3 'str' type. From graham.dumpleton at gmail.com Mon Sep 21 12:30:01 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 20:30:01 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210208y53f0a444n9746d4fde82f01b9@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <21787a9f0909202303v1b789b6fm79f7272cd01d53e4@mail.gmail.com> <4AB726E8.8090604@plope.com> <64ddb72c0909210054u14c60e9i18b83f75ba42063e@mail.gmail.com> <64ddb72c0909210208y53f0a444n9746d4fde82f01b9@mail.gmail.com> Message-ID: <88e286470909210330t6c518ed1pa427bbb22c5a4b3b@mail.gmail.com> 2009/9/21 Ren? Dudfield : > On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl wrote: >> Ren? Dudfield schrieb: >>> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough wrote: >>>> >>>> OTOH, I suspect the Python 3 stdlib is still broken if it requires native >>>> strings in various places (and prohibits the use of bytes). >>> >>> yes, python3 stdlib should support 'str'(the old unicode), 'buffer' >>> and 'bytes' for web using stuff. ?Buffer is important because it's a >>> type also used for sockets(along with bytes) and it allows less memory >>> allocation (because you can reuse buffers). >> >> Please don't confuse readers and use the correct name, i.e. 'bytearray' >> instead of 'buffer'. >> >> Georg >> > > Let me try and reduce the confusion... > > There are two different python types the py3k socket module uses: > 'bytes' and 'buffer'. ?'bytes' is kind of like str in python3... but > with reduced functionality (no formatting, less methods etc). ?buffer > is a Py_buffer from the c api. > > buffer, and bytes in socket: > http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into > bytearray: http://docs.python.org/3.1/library/functions.html#bytearray > bytes: http://docs.python.org/3.1/library/functions.html#bytes > buffer: http://docs.python.org/3.1/c-api/buffer.html > > This is separate, but related to the point of bytes vs unicode. ?It is > really (bytes and buffer) vs unicode - since bytes and buffer can be > used with socket. ?socket never uses a python2 'unicode', or a python3 > 'str' type. A WSGI adapter need not be sitting on top of a socket, it may be based on some lower level API which provides an abstract interface to the client connection. For example, in Apache the code handling a request doesn't deal with the socket. As such, requiring buffer/bytearray would likely stop you from using any embedded system within a web server, such as is the case for Apache/mod_wsgi. I would suspect that requiring buffer/bytearray would also prevent WSGI being used on top of CGI as well as file objects don't likely deal in those types either. I would also suggest that pursuing these types is just a case of premature optimisation. Where is your proof that using them would give any benefit? The web server layer is never the bottleneck in a web stack, it is the web application, its routing and rendering systems and any interaction with a database that are the bottleneck. It would be a waste of time to overly complicate the WSGI specification for absolutely no reason. People could get much better performance by simply paying attention to their own web applications and making them run better rather than praying that the underlying server is somehow going to make their application 4 times faster than anything else around. Maybe we can call this rush to prematurely optimise or jump on the bandwagon of the latest asynchronous server Tornado syndrome. ;-) Graham From graham.dumpleton at gmail.com Mon Sep 21 12:42:42 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 20:42:42 +1000 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <64ddb72c0909210108y687ec572p9a55121e552cf53f@mail.gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <64ddb72c0909210108y687ec572p9a55121e552cf53f@mail.gmail.com> Message-ID: <88e286470909210342y4ded5455kd89f467b8a2bc797@mail.gmail.com> 2009/9/21 Ren? Dudfield : > On Mon, Sep 21, 2009 at 2:46 AM, Robert Brewer wrote: > ... >> I want something in between so I don't have to wait months or years for >> WSGI 2. I want to ship a version of CherryPy with Python 3 support last >> week. > > +1 for wsgi 1.1 *very soon* using the "wsgi.url_encoding" idea Graham > made for unicode. At this point I would suggest that having 'wsgi.uri_encoding' in WSGI 2.0, as Armin describes, is probably better since the unicode hop is more than what a minor version change really should entail. Having definition #3 as WSGI 1.0 for Python 3.X is also probably just a waste of time and will just confuse. As it was stated by someone, too many versions of things isn't good and WSGI 1.0 as per definition #3 for Python 3.X is one such thing which is unnecessary. > With the next WSGI afterwards being an 'anything goes' spec, which > addresses all other issues and can come later (including async, using > buffers, and every other idea people can come up with). There are no other issues except for dropping start_response(), and async doesn't belong in WSGI. If you want async, then come up with a separate standard. You may well manage some overlap which allows sharing of some small subset of components, but in the main, a component which is blocking will not work on async and a component that uses async features isn't going to work on blocking. Why then would you make a specification overly complicated by trying to handle both when there is little if any mutual benefit. It is also likely that it is going to be hard enough to get people to switch over, so the last thing you want is drastic change. As Armin also points out, one doesn't know where web server technology is going. As such, better off only going as far as WSGI 3.0 as described and then let things settle down. Once that is all firmly in place and working well, than can step back and look at where web serving technology has gone in the mean time. Graham From renesd at gmail.com Mon Sep 21 13:15:55 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 12:15:55 +0100 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes Message-ID: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> On Mon, Sep 21, 2009 at 11:30 AM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > 2009/9/21 Ren? Dudfield : > > On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl wrote: > >> Ren? Dudfield schrieb: > >>> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough < > chrism-ccARneWBNkgAvxtiuMwx3w at public.gmane.org> wrote: > >>>> > >>>> OTOH, I suspect the Python 3 stdlib is still broken if it requires > native > >>>> strings in various places (and prohibits the use of bytes). > >>> > >>> yes, python3 stdlib should support 'str'(the old unicode), 'buffer' > >>> and 'bytes' for web using stuff. Buffer is important because it's a > >>> type also used for sockets(along with bytes) and it allows less memory > >>> allocation (because you can reuse buffers). > >> > >> Please don't confuse readers and use the correct name, i.e. 'bytearray' > >> instead of 'buffer'. > >> > >> Georg > >> > > > > Let me try and reduce the confusion... > > > > There are two different python types the py3k socket module uses: > > 'bytes' and 'buffer'. 'bytes' is kind of like str in python3... but > > with reduced functionality (no formatting, less methods etc). buffer > > is a Py_buffer from the c api. > > > > buffer, and bytes in socket: > > > http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into > > bytearray: http://docs.python.org/3.1/library/functions.html#bytearray > > bytes: http://docs.python.org/3.1/library/functions.html#bytes > > buffer: http://docs.python.org/3.1/c-api/buffer.html > > > > This is separate, but related to the point of bytes vs unicode. It is > > really (bytes and buffer) vs unicode - since bytes and buffer can be > > used with socket. socket never uses a python2 'unicode', or a python3 > > 'str' type. > > A WSGI adapter need not be sitting on top of a socket, it may be based > on some lower level API which provides an abstract interface to the > client connection. For example, in Apache the code handling a request > doesn't deal with the socket. As such, requiring buffer/bytearray > would likely stop you from using any embedded system within a web > server, such as is the case for Apache/mod_wsgi. I would suspect that > requiring buffer/bytearray would also prevent WSGI being used on top > of CGI as well as file objects don't likely deal in those types > either. > > I would also suggest that pursuing these types is just a case of > premature optimisation. Where is your proof that using them would give > any benefit? The web server layer is never the bottleneck in a web > stack, it is the web application, its routing and rendering systems > and any interaction with a database that are the bottleneck. It would > be a waste of time to overly complicate the WSGI specification for > absolutely no reason. People could get much better performance by > simply paying attention to their own web applications and making them > run better rather than praying that the underlying server is somehow > going to make their application 4 times faster than anything else > around. > > Maybe we can call this rush to prematurely optimise or jump on the > bandwagon of the latest asynchronous server Tornado syndrome. ;-) > > Graham > hi, Below are the reasons why I think considering buffers for a future post-wsgi-1.1 spec is useful. I don't think it should be considered for a wsgi 1.1 - I'm now in agreement with Robert that a wsgi 1.1 should come out very soon. My specific concern is that pythons stdlib also support 'buffer' (along with 'bytes' and 'str') - but that is separate from the new wsgi 1.1 spec discussion. --- I don't think *requiring* the use of buffers is needed... just making it *possible* to use them. buffer is one of the types that socket supports, so it makes sense to at least consider them. Using buffer would in no way make it impossible to use python in embedded webservers. You can easily make a Py_buffer from the same memory apache gives you to create python strings. In fact buffers allow you to support more embedded systems more easily - since strings are immutable, but not all embedded systems give you immutable data. Py_buffer also supports things like strides, non-contiguous memory, read/write information and other stuff which make it possible to use more types of memory. It's a lot more useful to use for python embedded in things than a string type. This is not just about performance, it's about considering the types used these days. One of the things that has changed since wsgi 1.0 came out is that python2.5 and above allow the use of a buffer with sockets. Python3 also changes the types to (str, buffer). mmap is also a very easy way to share data between multiple python processes - using the Py_buffer allows you to use mmap too. Not all applications are the same, and some do require lots of performance. My use case is video over the network for requiring buffers. When doing 100s of megabytes or gigabytes per second on one machine: copying, and allocating strings is a waste of time. Even allocating, and copying 200KB jpeg images is a big waste of time. It's basic programing optimization knowledge that allocating, and copying memory is slow. So is converting memory to various different encodings if it's not needed. Proof is by timing a string allocation + copy + transcode verses just using the buffer given. By having the server require allocating memory, require copying memory or require transcoding the memory - that's makes my use case a lot slower than it needs to be. cheers, -------------- next part -------------- An HTML attachment was scrubbed... URL: From renesd at gmail.com Mon Sep 21 13:29:31 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 12:29:31 +0100 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <88e286470909210342y4ded5455kd89f467b8a2bc797@mail.gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <64ddb72c0909210108y687ec572p9a55121e552cf53f@mail.gmail.com> <88e286470909210342y4ded5455kd89f467b8a2bc797@mail.gmail.com> Message-ID: <64ddb72c0909210429n6b95ad5br180a6065cb454e69@mail.gmail.com> On Mon, Sep 21, 2009 at 11:42 AM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > 2009/9/21 Ren? Dudfield : > > On Mon, Sep 21, 2009 at 2:46 AM, Robert Brewer > wrote: > > ... > >> I want something in between so I don't have to wait months or years for > >> WSGI 2. I want to ship a version of CherryPy with Python 3 support last > >> week. > > > > +1 for wsgi 1.1 *very soon* using the "wsgi.url_encoding" idea Graham > > made for unicode. > > At this point I would suggest that having 'wsgi.uri_encoding' in WSGI > 2.0, as Armin describes, is probably better since the unicode hop is > more than what a minor version change really should entail. Having > definition #3 as WSGI 1.0 for Python 3.X is also probably just a waste > of time and will just confuse. As it was stated by someone, too many > versions of things isn't good and WSGI 1.0 as per definition #3 for > Python 3.X is one such thing which is unnecessary. > > Hi, What are you suggesting? Do you have a preference yet? > > With the next WSGI afterwards being an 'anything goes' spec, which > > addresses all other issues and can come later (including async, using > > buffers, and every other idea people can come up with). > > There are no other issues except for dropping start_response(), and > async doesn't belong in WSGI. If you want async, then come up with a > separate standard. You may well manage some overlap which allows > sharing of some small subset of components, but in the main, a > component which is blocking will not work on async and a component > that uses async features isn't going to work on blocking. Why then > would you make a specification overly complicated by trying to handle > both when there is little if any mutual benefit. > > It is also likely that it is going to be hard enough to get people to > switch over, so the last thing you want is drastic change. As Armin > also points out, one doesn't know where web server technology is > going. As such, better off only going as far as WSGI 3.0 as described > and then let things settle down. Once that is all firmly in place and > working well, than can step back and look at where web serving > technology has gone in the mean time. > > Graham > As has been shown, async frameworks *can* support wsgi applications with things like greenlets. See the Eventlet library. I think a future spec could include solutions for lots of issues including. - considering async - considering buffer support - considering proxying support - considering lazily transcoding, allowing handling before reading from socket. - considering requests as first class objects rather than as function calls. -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Mon Sep 21 13:40:31 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 21:40:31 +1000 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> References: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> Message-ID: <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> 2009/9/21 Ren? Dudfield : > > > On Mon, Sep 21, 2009 at 11:30 AM, Graham Dumpleton > wrote: >> >> 2009/9/21 Ren? Dudfield : >> > On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl wrote: >> >> Ren? Dudfield schrieb: >> >>> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough >> >>> wrote: >> >>>> >> >>>> OTOH, I suspect the Python 3 stdlib is still broken if it requires >> >>>> native >> >>>> strings in various places (and prohibits the use of bytes). >> >>> >> >>> yes, python3 stdlib should support 'str'(the old unicode), 'buffer' >> >>> and 'bytes' for web using stuff. ?Buffer is important because it's a >> >>> type also used for sockets(along with bytes) and it allows less memory >> >>> allocation (because you can reuse buffers). >> >> >> >> Please don't confuse readers and use the correct name, i.e. 'bytearray' >> >> instead of 'buffer'. >> >> >> >> Georg >> >> >> > >> > Let me try and reduce the confusion... >> > >> > There are two different python types the py3k socket module uses: >> > 'bytes' and 'buffer'. ?'bytes' is kind of like str in python3... but >> > with reduced functionality (no formatting, less methods etc). ?buffer >> > is a Py_buffer from the c api. >> > >> > buffer, and bytes in socket: >> > >> > http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into >> > bytearray: http://docs.python.org/3.1/library/functions.html#bytearray >> > bytes: http://docs.python.org/3.1/library/functions.html#bytes >> > buffer: http://docs.python.org/3.1/c-api/buffer.html >> > >> > This is separate, but related to the point of bytes vs unicode. ?It is >> > really (bytes and buffer) vs unicode - since bytes and buffer can be >> > used with socket. ?socket never uses a python2 'unicode', or a python3 >> > 'str' type. >> >> A WSGI adapter need not be sitting on top of a socket, it may be based >> on some lower level API which provides an abstract interface to the >> client connection. For example, in Apache the code handling a request >> doesn't deal with the socket. As such, requiring buffer/bytearray >> would likely stop you from using any embedded system within a web >> server, such as is the case for Apache/mod_wsgi. I would suspect that >> requiring buffer/bytearray would also prevent WSGI being used on top >> of CGI as well as file objects don't likely deal in those types >> either. >> >> I would also suggest that pursuing these types is just a case of >> premature optimisation. Where is your proof that using them would give >> any benefit? The web server layer is never the bottleneck in a web >> stack, it is the web application, its routing and rendering systems >> and any interaction with a database that are the bottleneck. It would >> be a waste of time to overly complicate the WSGI specification for >> absolutely no reason. People could get much better performance by >> simply paying attention to their own web applications and making them >> run better rather than praying that the underlying server is somehow >> going to make their application 4 times faster than anything else >> around. >> >> Maybe we can call this rush to prematurely optimise or jump on the >> bandwagon of the latest asynchronous server Tornado syndrome. ;-) >> >> Graham > > > hi, > > > Below are the reasons why I think considering buffers for a future > post-wsgi-1.1 spec is useful.? I don't think it should be considered for a > wsgi 1.1 - I'm now in agreement with Robert that a wsgi 1.1 should come out > very soon.? My specific concern is that pythons stdlib also support 'buffer' > (along with 'bytes' and 'str') - but that is separate from the new wsgi 1.1 > spec discussion. > > > > --- > I don't think *requiring* the use of buffers is needed... just making it > *possible* to use them. > > buffer is one of the types that socket supports, so it makes sense to at > least consider them. > > Using buffer would in no way make it impossible to use python in embedded > webservers.? You can easily make a Py_buffer from the same memory apache > gives you to create python strings.? In fact buffers allow you to support > more embedded systems more easily - since strings are immutable, but not all > embedded systems give you immutable data.? Py_buffer also supports things > like strides, non-contiguous memory, read/write information and other stuff > which make it possible to use more types of memory.? It's a lot more useful > to use for python embedded in things than a string type. > > This is not just about performance, it's about considering the types used > these days.? One of the things that has changed since wsgi 1.0 came out is > that python2.5 and above allow the use of a buffer with sockets.? Python3 > also changes the types to (str, buffer).? mmap is also a very easy way to > share data between multiple python processes - using the Py_buffer allows > you to use mmap too. > > > Not all applications are the same, and some do require lots of performance. > > My use case is video over the network for requiring buffers.? When doing > 100s of megabytes or gigabytes per second on one machine: copying, and > allocating strings is a waste of time.? Even allocating, and copying 200KB > jpeg images is a big waste of time.? It's basic programing optimization > knowledge that allocating, and copying memory is slow.? So is converting > memory to various different encodings if it's not needed.? Proof is by > timing a string allocation + copy + transcode verses just using the buffer > given.? By having the server require allocating memory, require copying > memory or require transcoding the memory - that's makes my use case a lot > slower than it needs to be. No, proof would be someone taking CherryPy WSGI server and change it to use buffer and demonstrate that it works and that wouldn't cause an issue for a high level WSGI application. A low level benchmark of the performance of a single type versus another in a mock up test case isn't going to prove anything as that doesn't necessarily translate into anything usable. Sorry, if I am setting the bar quite high on this one, but it is quite nebulous that it would at all be useful and so an actual working example would be much more convincing. Graham From graham.dumpleton at gmail.com Mon Sep 21 14:02:17 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 21 Sep 2009 22:02:17 +1000 Subject: [Web-SIG] PEP 0333 and PEP XXXX Updated In-Reply-To: <64ddb72c0909210429n6b95ad5br180a6065cb454e69@mail.gmail.com> References: <4AB51F65.2010206@active-4.com> <88e286470909200437s795c3055u7edb9f8d2609f416@mail.gmail.com> <64ddb72c0909210108y687ec572p9a55121e552cf53f@mail.gmail.com> <88e286470909210342y4ded5455kd89f467b8a2bc797@mail.gmail.com> <64ddb72c0909210429n6b95ad5br180a6065cb454e69@mail.gmail.com> Message-ID: <88e286470909210502w2f62bd91x4f5d9b776cb31cca@mail.gmail.com> 2009/9/21 Ren? Dudfield : > > > On Mon, Sep 21, 2009 at 11:42 AM, Graham Dumpleton > wrote: >> >> 2009/9/21 Ren? Dudfield : >> > On Mon, Sep 21, 2009 at 2:46 AM, Robert Brewer >> > wrote: >> > ... >> >> I want something in between so I don't have to wait months or years for >> >> WSGI 2. I want to ship a version of CherryPy with Python 3 support last >> >> week. >> > >> > +1 for wsgi 1.1 *very soon* using the "wsgi.url_encoding" idea Graham >> > made for unicode. >> >> At this point I would suggest that having 'wsgi.uri_encoding' in WSGI >> 2.0, as Armin describes, is probably better since the unicode hop is >> more than what a minor version change really should entail. Having >> definition #3 as WSGI 1.0 for Python 3.X is also probably just a waste >> of time and will just confuse. As it was stated by someone, too many >> versions of things isn't good and WSGI 1.0 as per definition #3 for >> Python 3.X is one such thing which is unnecessary. >> > Hi, > > What are you suggesting?? Do you have a preference yet? Not all conversation about this are occurring on the WEB-SIG list. There are various side discussions happening more fully exploring the various suggestions and understanding them. It is being done off the list as the past has shown that if every detail is discussed on the list it goes on forever and then just collapses. We are really close this time and not going to let it fail again. If people for some reason think that I am going to come up with the final plan, then you'll just need to wait until I can find time again to blog about how recent discussions have factored into my view of the world. >> > With the next WSGI afterwards being an 'anything goes' spec, which >> > addresses all other issues and can come later (including async, using >> > buffers, and every other idea people can come up with). >> >> There are no other issues except for dropping start_response(), and >> async doesn't belong in WSGI. If you want async, then come up with a >> separate standard. You may well manage some overlap which allows >> sharing of some small subset of components, but in the main, a >> component which is blocking will not work on async and a component >> that uses async features isn't going to work on blocking. Why then >> would you make a specification overly complicated by trying to handle >> both when there is little if any mutual benefit. >> > > > >> It is also likely that it is going to be hard enough to get people to >> switch over, so the last thing you want is drastic change. As Armin >> also points out, one doesn't know where web server technology is >> going. As such, better off only going as far as WSGI 3.0 as described >> and then let things settle down. Once that is all firmly in place and >> working well, than can step back and look at where web serving >> technology has gone in the mean time. >> >> Graham > > > As has been shown, async frameworks *can* support wsgi applications with > things like greenlets.? See the Eventlet library. If greenlets do what I am led to believe, then there shouldn't be a need then to even have async mentioned in the WSGI specification at all, as it avoids the whole need for an async API at WSGI interface level. The underlying web server can use whatever internal interface it wants and WSGI interface using greenlets could be built on top of that. There doesn't need to be a standardised interface with that internal interface as it would be an issue for just that particular web server. > I think a future spec could include solutions for lots of issues including. > - considering async As I said, if this means some sort of separate API support, it should be a distinct specification from WSGI, and not a part of it. > - considering buffer support Not proven to be of any benefit at this stage. At least show that buffers can be used as a drop in replacement in existing WSGI applications and you are part way there. Show performance gains in a modified versions of an existing WSGI server for typical WSGI applications, then even better. So, demonstrate its worthwhile and it could be incorporated, but likely that only niche pure Python WSGI servers would use them as for generic WSGI servers, especially those building on non Python web server or infrastructure, likely not worth the trouble. > - considering proxying support Proxying generally requires much lower interaction with underlying request processing and transfer encoding mechanisms. This will not be possible with many hosting solutions. Again, may only be practical with pure Python web servers. > - considering lazily transcoding, allowing handling before reading from > socket. No idea what you are talking about. You would have to explain better. Do note that the WSGI environment dictionary is a Python dictionary and can't be replaced with a custom dictionary class type. Thus cannot directly be made to incorporate advanced functionality and doing so would make implementation of WSGI middleware likely much more fiddly. > - considering requests as first class objects rather than as function calls. Outside of scope for WSGI. WSGI is meant to be a low level interface between web server and Python web application. Just because people effectively abused it by using it through all levels of an application, doesn't mean that features intended to make its use in core of web applications simpler should be forced into the low level interface with the web server. In other words, come up with a specification for request objects and other stuff if you want, but it doesn't belong in WSGI, but would be a higher level layer that builds on it. Graham From renesd at gmail.com Mon Sep 21 14:25:55 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 13:25:55 +0100 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> References: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> Message-ID: <64ddb72c0909210525s58dc2705x339317a825470f@mail.gmail.com> On Mon, Sep 21, 2009 at 12:40 PM, Graham Dumpleton wrote: > > 2009/9/21 Ren? Dudfield : > > > > > > On Mon, Sep 21, 2009 at 11:30 AM, Graham Dumpleton > > wrote: > >> > >> 2009/9/21 Ren? Dudfield : > >> > On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl wrote: > >> >> Ren? Dudfield schrieb: > >> >>> On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough > >> >>> wrote: > >> >>>> > >> >>>> OTOH, I suspect the Python 3 stdlib is still broken if it requires > >> >>>> native > >> >>>> strings in various places (and prohibits the use of bytes). > >> >>> > >> >>> yes, python3 stdlib should support 'str'(the old unicode), 'buffer' > >> >>> and 'bytes' for web using stuff. ?Buffer is important because it's a > >> >>> type also used for sockets(along with bytes) and it allows less memory > >> >>> allocation (because you can reuse buffers). > >> >> > >> >> Please don't confuse readers and use the correct name, i.e. 'bytearray' > >> >> instead of 'buffer'. > >> >> > >> >> Georg > >> >> > >> > > >> > Let me try and reduce the confusion... > >> > > >> > There are two different python types the py3k socket module uses: > >> > 'bytes' and 'buffer'. ?'bytes' is kind of like str in python3... but > >> > with reduced functionality (no formatting, less methods etc). ?buffer > >> > is a Py_buffer from the c api. > >> > > >> > buffer, and bytes in socket: > >> > > >> > http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into > >> > bytearray: http://docs.python.org/3.1/library/functions.html#bytearray > >> > bytes: http://docs.python.org/3.1/library/functions.html#bytes > >> > buffer: http://docs.python.org/3.1/c-api/buffer.html > >> > > >> > This is separate, but related to the point of bytes vs unicode. ?It is > >> > really (bytes and buffer) vs unicode - since bytes and buffer can be > >> > used with socket. ?socket never uses a python2 'unicode', or a python3 > >> > 'str' type. > >> > >> A WSGI adapter need not be sitting on top of a socket, it may be based > >> on some lower level API which provides an abstract interface to the > >> client connection. For example, in Apache the code handling a request > >> doesn't deal with the socket. As such, requiring buffer/bytearray > >> would likely stop you from using any embedded system within a web > >> server, such as is the case for Apache/mod_wsgi. I would suspect that > >> requiring buffer/bytearray would also prevent WSGI being used on top > >> of CGI as well as file objects don't likely deal in those types > >> either. > >> > >> I would also suggest that pursuing these types is just a case of > >> premature optimisation. Where is your proof that using them would give > >> any benefit? The web server layer is never the bottleneck in a web > >> stack, it is the web application, its routing and rendering systems > >> and any interaction with a database that are the bottleneck. It would > >> be a waste of time to overly complicate the WSGI specification for > >> absolutely no reason. People could get much better performance by > >> simply paying attention to their own web applications and making them > >> run better rather than praying that the underlying server is somehow > >> going to make their application 4 times faster than anything else > >> around. > >> > >> Maybe we can call this rush to prematurely optimise or jump on the > >> bandwagon of the latest asynchronous server Tornado syndrome. ;-) > >> > >> Graham > > > > > > hi, > > > > > > Below are the reasons why I think considering buffers for a future > > post-wsgi-1.1 spec is useful.? I don't think it should be considered for a > > wsgi 1.1 - I'm now in agreement with Robert that a wsgi 1.1 should come out > > very soon.? My specific concern is that pythons stdlib also support 'buffer' > > (along with 'bytes' and 'str') - but that is separate from the new wsgi 1.1 > > spec discussion. > > > > > > > > --- > > I don't think *requiring* the use of buffers is needed... just making it > > *possible* to use them. > > > > buffer is one of the types that socket supports, so it makes sense to at > > least consider them. > > > > Using buffer would in no way make it impossible to use python in embedded > > webservers.? You can easily make a Py_buffer from the same memory apache > > gives you to create python strings.? In fact buffers allow you to support > > more embedded systems more easily - since strings are immutable, but not all > > embedded systems give you immutable data.? Py_buffer also supports things > > like strides, non-contiguous memory, read/write information and other stuff > > which make it possible to use more types of memory.? It's a lot more useful > > to use for python embedded in things than a string type. > > > > This is not just about performance, it's about considering the types used > > these days.? One of the things that has changed since wsgi 1.0 came out is > > that python2.5 and above allow the use of a buffer with sockets.? Python3 > > also changes the types to (str, buffer).? mmap is also a very easy way to > > share data between multiple python processes - using the Py_buffer allows > > you to use mmap too. > > > > > > Not all applications are the same, and some do require lots of performance. > > > > My use case is video over the network for requiring buffers.? When doing > > 100s of megabytes or gigabytes per second on one machine: copying, and > > allocating strings is a waste of time.? Even allocating, and copying 200KB > > jpeg images is a big waste of time.? It's basic programing optimization > > knowledge that allocating, and copying memory is slow.? So is converting > > memory to various different encodings if it's not needed.? Proof is by > > timing a string allocation + copy + transcode verses just using the buffer > > given.? By having the server require allocating memory, require copying > > memory or require transcoding the memory - that's makes my use case a lot > > slower than it needs to be. > > No, proof would be someone taking CherryPy WSGI server and change it > to use buffer and demonstrate that it works and that wouldn't cause an > issue for a high level WSGI application. > > A low level benchmark of the performance of a single type versus > another in a mock up test case isn't going to prove anything as that > doesn't necessarily translate into anything usable. > > Sorry, if I am setting the bar quite high on this one, but it is quite > nebulous that it would at all be useful and so an actual working > example would be much more convincing. > > Graham hi, As I said, performance isn't the only reason to consider it(other reasons already listed).? You seem to have decided it's only an optimization issue for some reason. An actual working example showing allocating 4.9MB of memory being slower than not allocating 4.9MB of memory? The only difference would be this in pseudo code (without error checking etc): def recv(socket, nbytes, dest = None): if dest is None: # not passing in a buffer, we need to allocate the memory. ??? buf = malloc(nbytes); dest = make_string_from_buffer(buf) ??? else: # writing directly into the buffer supplied. No malloc needed. buf = dest # do the socket recv sock_recv_guts(buf, nbytes) return dest Reusing the buffer lets you avoid the cost of malloc every time you read from the buffer. You can store buffers in memory pools (http://en.wikipedia.org/wiki/Memory_pool) to avoid mallocing/freeing all the time. There's reasons why the socket interface was changed to allow passing in buffers to use. It wasn't just added to python for no reason. That's all the arguing and explaining I'll do on this - I'm not going to rewrite cherrypy for you as proof. From armin.ronacher at active-4.com Mon Sep 21 14:27:54 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 14:27:54 +0200 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210525s58dc2705x339317a825470f@mail.gmail.com> References: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> <64ddb72c0909210525s58dc2705x339317a825470f@mail.gmail.com> Message-ID: <4AB7714A.2060502@active-4.com> Hi, Ren? Dudfield wrote: > That's all the arguing and explaining I'll do on this - I'm not going > to rewrite cherrypy for you as proof. If it just puts a burden on implementors on the client and server side and there is no proof for it to be faster for real world applications we can probably just ignore that then. Regards, Armin From renesd at gmail.com Mon Sep 21 14:58:14 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 13:58:14 +0100 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB7714A.2060502@active-4.com> References: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> <64ddb72c0909210525s58dc2705x339317a825470f@mail.gmail.com> <4AB7714A.2060502@active-4.com> Message-ID: <64ddb72c0909210558r6b9f65b9wef2d360e205644b2@mail.gmail.com> On Mon, Sep 21, 2009 at 1:27 PM, Armin Ronacher wrote: > Hi, > > Ren? Dudfield wrote: >> That's all the arguing and explaining I'll do on this - I'm not going >> to rewrite cherrypy for you as proof. > If it just puts a burden on implementors on the client and server side > and there is no proof for it to be faster for real world applications we > can probably just ignore that then. > > > Regards, > Armin > hi, yes I think ignoring it for now is a good idea. However, it could be a good addition to a future spec. Currently wsgi forces anything built on top to be able to not use them. It's zero extra work for implementors who don't want to specify a buffer. Implementors and clients can just not pass in or use a destination buffer. # non caring use: buf = recv(socket, nbytes) # buffer caring use: buffer = pool.get_buffer() buf = recv(socket, nbytes, buffer) So I don't see it as a burden to use for people who don't care about it. To explain the mmap use case more clearly... you could pass in a memory mapped buffer to allow the process to write to disk directly... or as shared memory so other processes can mmap the data and process it. Rather than sending your data over a pipe(as in fastcgi), you can just access it directly. As another piece of evidence that it is faster to use buffers, rather than allocate all the time, nginx uses memory pools. So does apache... and lighttpd... From and-py at doxdesk.com Mon Sep 21 14:49:50 2009 From: and-py at doxdesk.com (And Clover) Date: Mon, 21 Sep 2009 14:49:50 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB628C6.1000208@active-4.com> References: <4AB628C6.1000208@active-4.com> Message-ID: <4AB7766E.2060703@doxdesk.com> > A middleware might re-decode the values if the `wsgi.uri_encoding` is > `iso-8859-1` and only then. Seems like a mistake. If the middleware knows iso-8859-7 is in use, it would need to transcode the charset regardless of whether the initially-submitted bytes were a valid UTF-8 sequence or not. Otherwise the application would break when fed with eg. Greek words that happened to encode to valid UTF-8 bytes. > The application MUST use this value to decode the ``'QUERY_STRING'`` > as well. This will break all use of non-UTF-8 encodings in QUERY_STRING, where the path part of the URL does not contain non-UTF-8 sequences. That includes the very common case where the path part contains only ASCII. http://greek.example.com/myscript.cgi?x=%C2 will fail, as the given UTF-8 sniffer only looks at the path part to determine what encoding to use for both of the path part and the query string. I don't think WSGI should mandate any particular decoding of the QUERY_STRING. To be honest, I'm still uncomfortable with any use of Unicode strings in WSGI. But if we're going to do it, I'd go for consistency. Treating the decoding of the URL specially is a nasty hack that is only there because the CGI spec stupidly requires %-decoding to be done on PATH_INFO and SCRIPT_NAME. So why not go with (the long-ago suggested) optional variables like 'wsgi.real_path_info' that, if present, are the original strings before %-decoding? Now it doesn't greatly matter what string types and encodings we pick, because everything will be ASCII anyway. It also solves the %2F problem. If those variables are not present (typically for CGI environments that cannot provide them), the application/framework *may* try recover non-ASCII characters from PATH_INFO/QUERY_STRING, with undefined results. This is the broken-but-sometimes-rescuable status quo for CGI: by the time Python reads non-ASCII characters out of the environment they may already have been mangled by up to two conversion processes. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From armin.ronacher at active-4.com Mon Sep 21 15:21:32 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 15:21:32 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB7766E.2060703@doxdesk.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> Message-ID: <4AB77DDC.7030300@active-4.com> Hi, And Clover schrieb: > Seems like a mistake. If the middleware knows iso-8859-7 is in use, it > would need to transcode the charset regardless of whether the > initially-submitted bytes were a valid UTF-8 sequence or not. Otherwise > the application would break when fed with eg. Greek words that happened > to encode to valid UTF-8 bytes. The middleware can never know. WSGI will demand UTF-8 URLs and only provide iso-XXX support for backwards compatibility. > will fail, as the given UTF-8 sniffer only looks at the path part to > determine what encoding to use for both of the path part and the query > string. I don't think WSGI should mandate any particular decoding of the > QUERY_STRING. That is indeed a limitation in the specification. That's something we have to think about. Good catch. Regards, Armin From kirke.bent at verizon.net Mon Sep 21 15:50:10 2009 From: kirke.bent at verizon.net (Kirke Bent) Date: Mon, 21 Sep 2009 09:50:10 -0400 Subject: [Web-SIG] WSGI Comments Message-ID: As a newbie and civilian, I probably shouldn't have a vote. However, I will express a preference for a 3.x version as forward-looking as possible, which seems to mean Unicode, doubtless among other things. Kirke Kirke Bent 973-635-0301 kirke.bent at verizon.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From fumanchu at aminus.org Mon Sep 21 17:18:12 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 08:18:12 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB7766E.2060703@doxdesk.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> Message-ID: And Clover wrote: > > A middleware might re-decode the values if the `wsgi.uri_encoding` > > is `iso-8859-1` and only then. > > Seems like a mistake. If the middleware knows iso-8859-7 is in use, it > would need to transcode the charset regardless of whether the > initially-submitted bytes were a valid UTF-8 sequence or not. Otherwise > the application would break when fed with eg. Greek words that happened > to encode to valid UTF-8 bytes. If the entire site expects iso-8859-7 Request-URL's then the deployer should tell the WSGI server to decode using iso-8859-7 instead of utf-8. If only part of the site expects iso-8859-7 then...yeah, it needs to transcode. So what? > > The application MUST use this value to decode the ``'QUERY_STRING'`` > > as well. > > This will break all use of non-UTF-8 encodings in QUERY_STRING, where > the path part of the URL does not contain non-UTF-8 sequences. That > includes the very common case where the path part contains only ASCII. > > http://greek.example.com/myscript.cgi?x=%C2 > > will fail, as the given UTF-8 sniffer only looks at the path part to > determine what encoding to use for both of the path part and the query > string. No, it won't fail. WSGI servers do not perform %-decoding of the QUERY_STRING. In the example given, a WSGI 1.1 server will set the Python 3 environ values: {'SCRIPT_NAME': '', 'PATH_INFO': 'myscript.cgi', 'QUERY_STRING': 'x=%C2'} Robert Brewer fumanchu at aminus.org From pje at telecommunity.com Mon Sep 21 17:19:48 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 11:19:48 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB70044.8010204@plope.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> Message-ID: <20090921151951.2BD973A403D@sparrow.telecommunity.com> At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: >Anyway, for us slower (and maybe wrongly fearful) folks, could >someone summarize the benefits of having a WSGI specification that >requires Unicode. Bonus points for an explanation that does not boil >down to "it will be compatible with Python 3". +1. I'd really rather not have the spec dictated by the need to work around problems in the stdlib or language definition. Better to fix them ASAP. From pje at telecommunity.com Mon Sep 21 17:23:44 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 11:23:44 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB715CA.1070404@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <4AB715CA.1070404@active-4.com> Message-ID: <20090921152343.7CC3D3A4119@sparrow.telecommunity.com> At 07:57 AM 9/21/2009 +0200, Armin Ronacher wrote: >Hi, > >Chris McDonough schrieb: > > Personally, I find it a bit hard to get excited about Python 3 as a web > > application deployment platform. >Everybody feels that way currently. But if we don't fix WSGI that will >never change. This is only compounding the errors introduced by the "make the tests pass" philosophy of "porting" the stdlib. We should not make them worse. At the moment (AFAIK) nobody has gone through the web bits of the stdlib and asked, "Should this work on strings, bytes, or both, and if both, how should that API be expressed?" From pje at telecommunity.com Mon Sep 21 17:26:21 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 11:26:21 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> Message-ID: <20090921152619.F1C803A4156@sparrow.telecommunity.com> At 08:18 AM 9/21/2009 -0700, Robert Brewer wrote: >If the entire site expects iso-8859-7 Request-URL's then the deployer >should tell the WSGI server to decode using iso-8859-7 instead of utf-8. Can we please please not add any more deployment options to WSGI? Options are complexity multipliers. (Also note that the use of middleware and component-based applications means that there is no such thing as an "entire site" that could make such a decision, while preserving the composability of WSGI apps.) From ubernostrum at gmail.com Mon Sep 21 17:27:28 2009 From: ubernostrum at gmail.com (James Bennett) Date: Mon, 21 Sep 2009 10:27:28 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921151951.2BD973A403D@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> Message-ID: <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.com> On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby wrote: > +1. ?I'd really rather not have the spec dictated by the need to work around > problems in the stdlib or language definition. ?Better to fix them ASAP. This is a *Python* web server gateway interface, yes? Fixing stdlib bugs is fine, but asking for the language to change just to make gateway interfaces a bit easier to write seems a bit much; I'd hope we can take Python the language as granted, and work from there. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." From renesd at gmail.com Mon Sep 21 17:30:46 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 16:30:46 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921151951.2BD973A403D@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> Message-ID: <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby wrote: > At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: >> >> Anyway, for us slower (and maybe wrongly fearful) folks, could someone >> summarize the benefits of having a WSGI specification that requires Unicode. >> Bonus points for an explanation that does not boil down to "it will be >> compatible with Python 3". > > +1. ?I'd really rather not have the spec dictated by the need to work around > problems in the stdlib or language definition. ?Better to fix them ASAP. > hi, here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? From renesd at gmail.com Mon Sep 21 17:33:51 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 16:33:51 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.com> Message-ID: <64ddb72c0909210833l19f9bdb9ifa1092f6933d3bed@mail.gmail.com> On Mon, Sep 21, 2009 at 4:27 PM, James Bennett wrote: > On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby wrote: >> +1. ?I'd really rather not have the spec dictated by the need to work around >> problems in the stdlib or language definition. ?Better to fix them ASAP. > > This is a *Python* web server gateway interface, yes? Fixing stdlib > bugs is fine, but asking for the language to change just to make > gateway interfaces a bit easier to write seems a bit much; I'd hope we > can take Python the language as granted, and work from there. > > Hi, I mostly agree... However, python3.x changes are still up for grabs... so if there's a good enough reason, now is the time to ask for changes. I don't see them changing the way unicode, strings and bytes work too much though. cheers, From pje at telecommunity.com Mon Sep 21 17:42:35 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 11:42:35 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.co m> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> Message-ID: <20090921154234.85B393A4156@sparrow.telecommunity.com> At 04:30 PM 9/21/2009 +0100, Ren? Dudfield wrote: >On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby wrote: > > At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: > >> > >> Anyway, for us slower (and maybe wrongly fearful) folks, could someone > >> summarize the benefits of having a WSGI specification that > requires Unicode. > >> Bonus points for an explanation that does not boil down to "it will be > >> compatible with Python 3". > > > > +1. I'd really rather not have the spec dictated by the need to > work around > > problems in the stdlib or language definition. Better to fix them ASAP. > > > >hi, > >here is a summary: > Apart from python3 compatibility(which should be good enough >reason), utf-8 is what's used in http a lot these days. Most things >layered on top of wsgi are using utf-8 (django etc), and lots of web >clients are using utf-8 (firefox etc). Since WSGI is based on HTTP, please cite RFCs, not applications. Thanks. From pje at telecommunity.com Mon Sep 21 17:46:21 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 11:46:21 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.co m> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.com> Message-ID: <20090921154620.799513A4156@sparrow.telecommunity.com> At 10:27 AM 9/21/2009 -0500, James Bennett wrote: >On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby wrote: > > +1. I'd really rather not have the spec dictated by the need to > work around > > problems in the stdlib or language definition. Better to fix them ASAP. > >This is a *Python* web server gateway interface, yes? Fixing stdlib >bugs is fine, but asking for the language to change just to make >gateway interfaces a bit easier to write seems a bit much; I'd hope we >can take Python the language as granted, and work from there. I'm not arguing that WSGI should dictate what Python 3 does. But if we're having so much trouble doing something so simple in a way that work on both Python 2 and Python 3, doesn't that suggest that anybody doing *anything* non-trivial is going to have similar problems? This discussion has been making me wonder what other unicode/bytes problems I'm going to have on Python 3, and raising the ugly spectre of duplicated, type-specific APIs ala Java... only without the overloading that lets you give them the same method names. :-( From renesd at gmail.com Mon Sep 21 18:00:06 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 17:00:06 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921154234.85B393A4156@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <20090921154234.85B393A4156@sparrow.telecommunity.com> Message-ID: <64ddb72c0909210900k7076816ctee9a729050d2822d@mail.gmail.com> On Mon, Sep 21, 2009 at 4:42 PM, P.J. Eby wrote: > At 04:30 PM 9/21/2009 +0100, Ren? Dudfield wrote: >> >> On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby wrote: >> > At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: >> >> >> >> Anyway, for us slower (and maybe wrongly fearful) folks, could someone >> >> summarize the benefits of having a WSGI specification that requires >> >> Unicode. >> >> Bonus points for an explanation that does not boil down to "it will be >> >> compatible with Python 3". >> > >> > +1. ?I'd really rather not have the spec dictated by the need to work >> > around >> > problems in the stdlib or language definition. ?Better to fix them ASAP. >> > >> >> hi, >> >> here is a summary: >> ? ?Apart from python3 compatibility(which should be good enough >> reason), utf-8 is what's used in http a lot these days. ?Most things >> layered on top of wsgi are using utf-8 (django etc), and lots of web >> clients are using utf-8 (firefox etc). > > Since WSGI is based on HTTP, please cite RFCs, not applications. ?Thanks. > > Hi, That seems a strange thing to say. HTTP use is based on not only RFCs but real applications. Web Server Gateway Interface is not just about HTTP obviously, and talks about python and web server issues... it hardly restricts itself to HTTP. See IRIs: http://www.w3.org/International/O-URL-and-ident.html Which links to a number of things including rfc2718, which specifies utf-8 for URIs: http://www.ietf.org/rfc/rfc2718.txt Character encoding section: """Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279) [3] and then subsequently using the %HH encoding for unsafe octets is recommended.""" Which seems sensible. Having fallback to the raw bytes available also seems sensible. For the reasons discussed in previous posts. cheers, From brian at briansmith.org Mon Sep 21 17:56:08 2009 From: brian at briansmith.org (Brian Smith) Date: Mon, 21 Sep 2009 10:56:08 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921154234.85B393A4156@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <20090921154234.85B393A4156@sparrow.telecommunity.com> Message-ID: <000001ca3ad4$0c23a1e0$246ae5a0$@org> P.J. Eby wrote: > Since WSGI is based on HTTP, please cite RFCs, not applications. > Thanks. RFC 3987 (the IRI specification) is the closest thing we have to an interoperable specification for internationalized URLs. It uses Unicode (UTF-8) exclusively. My own opinion is that WSGI for Python 3 (what people have been calling "WSGI 2" or "WSGI 3" in these threads) only needs to be defined for URLs that meet the requirements of RFC 3987. If you need non-UTF8 URLs then don't use WSGI on Python 3. RE WSGI 1.0 vs 1.1 vs 2.0 vs. 3.0: That would make WSGI very confusing. Fix the inconsistent/underspecified parts of WSGI 1.0 in an updated version of PEP 333. That would be the only WSGI specification for Python 2.0. Whatever "WSGI 3.0" would become should be the only WSGI specification for Python 3. Punt on the idea of running WSGI 1.0 applications on Python 3 or WSGI 3.0 applications on Python 2.x. Regards, Brian From ianb at colorstudy.com Mon Sep 21 18:09:24 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 21 Sep 2009 11:09:24 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB628C6.1000208@active-4.com> References: <4AB628C6.1000208@active-4.com> Message-ID: On Sun, Sep 20, 2009 at 8:06 AM, Armin Ronacher wrote: > Thanks to Graham Dumpleton and Robert Brewer there is some serious > progress on WSGI currently. ?I proposed a roadmap with some PEP changes > now that need some input. > > Summary: > > ?WSGI 1.0 ? ? ? stays the same as PEP 0333 currently is > ?WSGI 1.1 ? ? ? becomes what Ian and I added to PEP 0333 > ?WSGI 2.0 ? ? ? becomes a unicode powered version of WSGI 1.1 > ?WSGI 3.0 ? ? ? becomes WSGI 2.0 just without start_response > > ?WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python > ?3 because of changes in the standard library that no longer work with > ?a byte-only approach. 1.1 I think of as an errata on 1.0, so... simple enough. I was skeptical about a unicode version of WSGI, but I think I'm okay with it now. For people who use UTF-8-only it should be fairly simple and easy; for people who want to deal with other encodings, backward compatible URLs, or other weirdness I think surrogateescape can resolve the small handful of problems. Maybe an option to use latin1 (at the server level) would do the same for Python 2, as a deployment option for people who are dealing with these tricky issues. Which is kind of lame, but it means everything is still *possible*, and the use cases are somewhat obscure. Especially because QUERY_STRING and wsgi.input remain bytes. (Well, I guess the other case would be someone reading a cookie set by an application they do not control, and set in a crazy way... but anyway, there's a handful of use cases where things get tricky, but we can kind of punt, or try to implement the necessary transcoding routines before the spec is final.) I'm very much opposed to a second "raw" version of the request, as I do not like redundancy. With respect to 3.0/start_response, I'd rather we just do both at once, so there's not so many versions of WSGI to worry about. Also it doesn't feel like a very difficult change to make. The only other major issue is wsgi.input, which is a quite awkward interface to the request body. But I think resolving that is harder than start_response, in particular because there's no clear solution. Maybe at least switching to a file interface would be better. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker From fumanchu at aminus.org Mon Sep 21 18:38:53 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 09:38:53 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921152343.7CC3D3A4119@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com><4AB715CA.1070404@active-4.com> <20090921152343.7CC3D3A4119@sparrow.telecommunity.com> Message-ID: P.J. Eby wrote: > At 07:57 AM 9/21/2009 +0200, Armin Ronacher wrote: > >Chris McDonough schrieb: > > > Personally, I find it a bit hard to get excited about Python 3 as a > > > web application deployment platform. > > Everybody feels that way currently. But if we don't fix WSGI that > > will never change. > > This is only compounding the errors introduced by the "make the tests > pass" philosophy of "porting" the stdlib. We should not make them > worse. > > At the moment (AFAIK) nobody has gone through the web bits of the > stdlib and asked, "Should this work on strings, bytes, or both, and > if both, how should that API be expressed?" Perhaps not, but I wrote unquote_bytes at PyCon 2009, after discussing urllib in the python-dev room and being told no bytes-compatible version was desired in the stdlib. So *some* thought has gone into it. Robert Brewer fumanchu at aminus.org From fumanchu at aminus.org Mon Sep 21 19:05:49 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 10:05:49 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB77DDC.7030300@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> Message-ID: Armin Ronacher wrote: > WSGI will demand UTF-8 URLs and only > provide iso-XXX support for backwards compatibility. WSGI cannot demand that; a recommendation for utf-8 in a few draft specifications is at least a decade removed from ubiquitous implementation. We can default to utf-8 at best. I discussed this at length in http://mail.python.org/pipermail/web-sig/2009-August/003948.html Robert Brewer fumanchu at aminus.org From g.brandl at gmx.net Mon Sep 21 19:37:30 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 21 Sep 2009 19:37:30 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210833l19f9bdb9ifa1092f6933d3bed@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <21787a9f0909210827x292d76a2q754213a1b0d7b366@mail.gmail.com> <64ddb72c0909210833l19f9bdb9ifa1092f6933d3bed@mail.gmail.com> Message-ID: Ren? Dudfield schrieb: > On Mon, Sep 21, 2009 at 4:27 PM, James Bennett wrote: >> On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby wrote: >>> +1. I'd really rather not have the spec dictated by the need to work around >>> problems in the stdlib or language definition. Better to fix them ASAP. >> >> This is a *Python* web server gateway interface, yes? Fixing stdlib >> bugs is fine, but asking for the language to change just to make >> gateway interfaces a bit easier to write seems a bit much; I'd hope we >> can take Python the language as granted, and work from there. >> >> > > Hi, > > I mostly agree... However, python3.x changes are still up for > grabs... so if there's a good enough reason, now is the time to ask > for changes. That stage has already passed. It was true before 3.0, and even before 3.1. Now that 3.1 is out and labeled stable, the same backward-compatibility conventions as for 2.x are in effect; you will find much opposition on python-dev for such changes. That does not mean that no change is possible in 3.x, but that was also never the case for 2.x. Compatible changes are ok, and so are changes that undergo the proper deprecation process. Georg From renesd at gmail.com Mon Sep 21 19:57:03 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 18:57:03 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> Message-ID: <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> On Mon, Sep 21, 2009 at 6:05 PM, Robert Brewer wrote: > Armin Ronacher wrote: >> WSGI will demand UTF-8 URLs and only >> provide iso-XXX support for backwards compatibility. > > WSGI cannot demand that; a recommendation for utf-8 in a few draft > specifications is at least a decade removed from ubiquitous > implementation. We can default to utf-8 at best. I discussed this at > length in > http://mail.python.org/pipermail/web-sig/2009-August/003948.html > > Hi, that post does have good arguments why "a single encoding is not acceptable". utf-8 seems the most common at this point to be the default... but we do need a way to specify encoding. Is that what you're saying Robert? Do you have a suggestion for specifying encodings? I think surrogateescape will handle the issues with allowing bytes to be stored in utf-8. http://www.python.org/dev/peps/pep-0383/ However, I think that is only implemented in python 3.1?... but maybe there is someway to have it work on other pythons too? How about... Being able to request which encoding you want has the benefit of only having to store one representation before 'baking' the result into the environ. So if someone only ever wants utf-8 they can get it... however if they choose to 'bake' the environ then they can request something else. This is similar to a per server setting, but I think should work with middleware too? As multiple things should be available, and if baked middleware (if it wants to modify things, will need to change each version of things). These 'baking' methods could live in wsgi to simplify modifying the environs multiple versions of things. It would just have some get/set functions to put correct handling of encodings in one place. Of course middleware is still free to change things as it wants. cheers, From henry at precheur.org Mon Sep 21 19:58:40 2009 From: henry at precheur.org (Henry Precheur) Date: Mon, 21 Sep 2009 10:58:40 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> Message-ID: <20090921175840.GA9880@banane.novuscom.net> On Mon, Sep 21, 2009 at 11:09:24AM -0500, Ian Bicking wrote: > I think surrogateescape can resolve the small handful of problems. +1 surrogateescape would be a great alternative to the "try utf-8 then latin-1" approach. It would simplify the gateway and the application. No need to check some 'encoding' variable and transcode later. We just encode everything to UTF-8, no special case. surrogateescape isn't implemented (yet?) for Python 2. That's not an issue if the 'new' WSGI sticks to native strings. -- Henry Pr?cheur From and-py at doxdesk.com Mon Sep 21 20:15:28 2009 From: and-py at doxdesk.com (And Clover) Date: Mon, 21 Sep 2009 20:15:28 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB77DDC.7030300@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> Message-ID: <4AB7C2C0.1080304@doxdesk.com> Armin Ronacher wrote: > The middleware can never know. It's much more likely than to know than the server though! > WSGI will demand UTF-8 URLs and only > provide iso-XXX support for backwards compatibility. It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm as much an advocate of "UTF-8 for everything everywhere!" as anyone else, but unfortunately today there are still dark places where you need non-UTF-8 URLs. Incidentally, if wsgi.uri_encoding is going to be the way to signal that the server has decoded bytes to characters using a known encoding, it should be stressed that this should only be set when that encoding is certain. That is, wsgi.uri_encoding should be omitted (or None?) in cases where another party has already decoded (and maybe mangled) the bytes using an unknown encoding. In particular, CGI. (In the case of Windows CGI the server will have decoded URI bytes into Unicode characters, using a charset which it is impossible to find out. In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF sequence, otherwise it's the system codepage. This problem affects the non-CGI implementation isapi_wsgi, too. Then the variables are read as environment variables, which for Python 2 means another encode/decode step on Windows using the system codepage, mangling non-codepage characters. Python 3 has the opposite problem reading byte envvars using UTF-8, which won't be how Apache put them there.) If wsgi.encoding is obligatory then in reality it will often be wrong, leaving us in the same pathetic predicament as with WSGI 1.0, where non-ASCII URIs don't work reliably at all. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From fumanchu at aminus.org Mon Sep 21 20:23:47 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 11:23:47 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> Message-ID: Ren? Dudfield wrote: > On Mon, Sep 21, 2009 at 6:05 PM, Robert Brewer > wrote: > > Armin Ronacher wrote: > >> WSGI will demand UTF-8 URLs and only > >> provide iso-XXX support for backwards compatibility. > > > > WSGI cannot demand that; a recommendation for utf-8 in a few draft > > specifications is at least a decade removed from ubiquitous > > implementation. We can default to utf-8 at best. I discussed this at > > length in > > http://mail.python.org/pipermail/web-sig/2009-August/003948.html > > > > that post does have good arguments why "a single encoding is not > acceptable". utf-8 seems the most common at this point to be the > default... but we do need a way to specify encoding. > > Is that what you're saying Robert? Do you have a suggestion for > specifying encodings? CherryPy 3.2 does this (pseudocode): try: decode_uri(userdefault or 'utf-8') except UnicodeDecodeError: decode_uri('iso-8859-1') > I think surrogateescape will handle the issues with allowing bytes to > be stored in utf-8. > http://www.python.org/dev/peps/pep-0383/ > > However, I think that is only implemented in python 3.1?... but maybe > there is someway to have it work on other pythons too? As Henry Pr?cheur says, "that's not an issue if the 'new' WSGI sticks to native strings." Which I'd be happy with. > How about... > > Being able to request which encoding you want has the benefit of only > having to store one representation before 'baking' the result into the > environ. So if someone only ever wants utf-8 they can get it... > however if they choose to 'bake' the environ then they can request > something else. This is similar to a per server setting, but I think > should work with middleware too? As noted above, it *is* a per-server setting in CherryPy 3.2. And any middleware can certainly be configured as its authors see fit; I don't see a need for a generic mechanism to specify what encodings middleware should try. However, we still need a generic mechanism declaring which encoding was successfully used; this is 'wsgi.uri_encoding'. > As multiple things should be > available, and if baked middleware (if it wants to modify things, will > need to change each version of things). > > These 'baking' methods could live in wsgi to simplify modifying the > environs multiple versions of things. It would just have some get/set > functions to put correct handling of encodings in one place. Of > course middleware is still free to change things as it wants. I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. Robert Brewer fumanchu at aminus.org From armin.ronacher at active-4.com Mon Sep 21 21:14:13 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 21 Sep 2009 21:14:13 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921175840.GA9880@banane.novuscom.net> References: <4AB628C6.1000208@active-4.com> <20090921175840.GA9880@banane.novuscom.net> Message-ID: <4AB7D085.7090503@active-4.com> Hi, Henry Precheur schrieb: > surrogateescape isn't implemented (yet?) for Python 2. That's not an > issue if the 'new' WSGI sticks to native strings. So the same standard should have different behavior on different Python versions? That would make framework code a lot more complicated. Regards, Armin From pje at telecommunity.com Mon Sep 21 21:31:28 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 15:31:28 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> Message-ID: <20090921193128.18C623A407A@sparrow.telecommunity.com> At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: >I still don't see why the environ should have multiple versions of >anything. It's not as if the HTTP request gives us multiple >Request-URI's. There's a single processing step that has to happen >somewhere: decoding the bytes of the Request-URI to unicode. For the >vast majority of apps, it should only happen once. Twice is >acceptable to me for some apps. As I pointed out in the linked >email, doing that as soon as possible (i.e. in the WSGI origin >server) allows URI's to be compared as character strings more >easily. If you deploy a piece of middleware that transcodes (based >on more information than servers want to deal with), it had better >be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. Maybe I'm missing something here, but the only way I see to preserve composability here is to use latin-1 or bytes. The fundamental problem is that, like it or not, HTTP headers are actually byte strings. The *only* reason we ever supported unicode in WSGI was to handle platforms where there's no such thing as a non-unicode string, and there we made it explicit that it's just a way of manipulating *bytes*, not unicode. ISTM that very few (if any) of the proposals floating around for modifying WSGI are taking this concept into account. Most of them sound to me like people saying, "yeah, but this particular hack will work for *my* apps... so everybody else must be doing something stupid." But WSGI was built on the principle of *equally inconveniencing everyone*, specifically to avoid an impossible attempt at consensus between incompatible ways of doing things. (E.g., nine million request/response APIs.) So, if the only problem we're going to cause by using bytes everywhere is to make everyone need to change their routing code on Python 3, I vote +1000. ;-) From renesd at gmail.com Mon Sep 21 21:49:01 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Mon, 21 Sep 2009 20:49:01 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921193128.18C623A407A@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> Message-ID: <64ddb72c0909211249k41ef5a2ax57d1d5b21d9fdba9@mail.gmail.com> On Mon, Sep 21, 2009 at 8:31 PM, P.J. Eby wrote: > At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: >> >> I still don't see why the environ should have multiple versions of >> anything. It's not as if the HTTP request gives us multiple Request-URI's. >> There's a single processing step that has to happen somewhere: decoding the >> bytes of the Request-URI to unicode. For the vast majority of apps, it >> should only happen once. Twice is acceptable to me for some apps. As I >> pointed out in the linked email, doing that as soon as possible (i.e. in the >> WSGI origin server) allows URI's to be compared as character strings more >> easily. If you deploy a piece of middleware that transcodes (based on more >> information than servers want to deal with), it had better be nearly first >> in the stack so routing works reliably. > > The problem with this whole approach is that it's not composable. ?You can't > stick in an application under a router that uses a different method for > grokking its subtree of the URI space, unless it knows what's been done to > the URI and can un-do it. > It seems latin-1 has the same problem. If middleware makes an artbitary change, how can later things know what it's done? From fumanchu at aminus.org Mon Sep 21 22:15:09 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 13:15:09 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921193128.18C623A407A@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> Message-ID: P.J. Eby wrote: > At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: > >I still don't see why the environ should have multiple versions of > >anything. It's not as if the HTTP request gives us multiple > >Request-URI's. There's a single processing step that has to happen > >somewhere: decoding the bytes of the Request-URI to unicode. For the > >vast majority of apps, it should only happen once. Twice is > >acceptable to me for some apps. As I pointed out in the linked > >email, doing that as soon as possible (i.e. in the WSGI origin > >server) allows URI's to be compared as character strings more > >easily. If you deploy a piece of middleware that transcodes (based > >on more information than servers want to deal with), it had better > >be nearly first in the stack so routing works reliably. > > The problem with this whole approach is that it's not > composable. You can't stick in an application under a router that > uses a different method for grokking its subtree of the URI space, > unless it knows what's been done to the URI and can un-do it. I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are unicode, the only answer to "what's been done to the URI?" can be "wsgi.uri_encoding", which allows someone to un-do it. What more do you want? 1. bytes arrive. server decodes with utf8, sets 'wsgi.uri_encoding' to 'utf-8'. 2. middleware says "oops, that's wrong". encodes back to bytes using 'utf-8', and re-decodes with koi-8, changing wsgi.uri_encoding to 'koi-8' 3. further middlewares and app use the unicode value, and don't really care what encoding was used. > Maybe I'm missing something here, but the only way I see to preserve > composability here is to use latin-1 or bytes. > > The fundamental problem is that, like it or not, HTTP headers are > actually byte strings. The *only* reason we ever supported unicode > in WSGI was to handle platforms where there's no such thing as a > non-unicode string, and there we made it explicit that it's just a > way of manipulating *bytes*, not unicode. > > ISTM that very few (if any) of the proposals floating around for > modifying WSGI are taking this concept into account. Most of them > sound to me like people saying, "yeah, but this particular hack will > work for *my* apps... so everybody else must be doing something > stupid." > > But WSGI was built on the principle of *equally inconveniencing > everyone*, specifically to avoid an impossible attempt at consensus > between incompatible ways of doing things. (E.g., nine million > request/response APIs.) > > So, if the only problem we're going to cause by using bytes > everywhere is to make everyone need to change their routing code on > Python 3, I vote +1000. ;-) That's not the only problem. Using native strings wherever possible makes web programing in Python easier, regardless of version. In Python 3, that happens to be unicode, for good reasons. For HTTP, there's a more specific reason: URI's should be compared for equivalence character by character, not byte by byte. See http://tools.ietf.org/html/rfc3986#section-6.2.1. That includes routing middleware. Robert Brewer fumanchu at aminus.org From pje at telecommunity.com Mon Sep 21 23:24:13 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 17:24:13 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> Message-ID: <20090921212414.454C43A407A@sparrow.telecommunity.com> At 01:15 PM 9/21/2009 -0700, Robert Brewer wrote: >I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are >unicode, the only answer to "what's been done to the URI?" can be >"wsgi.uri_encoding", which allows someone to un-do it. What more do you want? To be sure that there's no possible way for all the broken middleware out there to mess this up. Let me put it this way: out of all the times I've seen people post example WSGI 1 middleware code, I don't remember *any* where the middleware was actually complying with the spec correctly... and that includes examples I wrote myself. So I'm not real impressed with any solution that requires middleware to get it right. That having been said, I'm beginning to think that PEP 383 (surrogateescape) is actually the way to go, now that I've looked over the PEP, docs, and Ian's posts here about it. First, it's compatible with CGI (os.environ) right off the bat, as well as being the standard way to handle this sort of issue in Python 3. Second, it's redundancy-free: you don't need a separate environ key to know what's going on. Third, it's unconditional: if you want bytes or a non-UTF-8 encoding you perform the same steps every time. Up until now, I've not paid much attention because so many people kept saying you can't get surrogateescape on Python 2. However, that's only an issue for code that *needs the original byte string*, as the old codec error handler API is sufficient for doing decoding. (Meaning you could register a handler for it on older Pythons.) I think this approach would let us have our cake and eat it too, for the most part. WSGI on Python 2.x uses byte strings for these, and then 3.x works transparently. It's a bit of a stretch to call it a "clarification" of WSGI 1.0, but since for all intents and purposes WSGI doesn't really *run* on Python 3, it might be the way to go. To be clear, I'm talking about simply allowing (on Python 3 and in WSGI versions>1.0) for all environ values to be utf-8-decoded, surrogate-escaped unicode values, in the "native string" case. (This would further imply that a CGI gateway would have to check whether the system encoding is UTF-8, and if not, transcode accordingly.) From henry at precheur.org Mon Sep 21 22:52:18 2009 From: henry at precheur.org (Henry Precheur) Date: Mon, 21 Sep 2009 13:52:18 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB7D085.7090503@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090921175840.GA9880@banane.novuscom.net> <4AB7D085.7090503@active-4.com> Message-ID: <20090921205218.GA222@banane.novuscom.net> On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote: > So the same standard should have different behavior on different Python > versions? That would make framework code a lot more complicated. I don't understand why it would be 'a lot more' complicated. (The following code snippets is Python 3 only, and assumes we're using 'native strings' everywhere) In the gateway, environ would be populated this way: environ['some_key'] = some_value.decode('utf8', 'surrogateescape') Compare that to the utf-8-then-latin-1 alternative: try: environ['some_key'] = some_value.decode('utf-8') environ['some_key.encoding'] = 'utf-8' except UnicodeError: environ['some_key'] = some_value.decode('latin-1') environ['some_key.encoding'] = 'latin-1' What you would have in the application to get the original value: environ['some_key'].encode('utf8', 'surrogateescape') With utf8-then-latin1: environ['some_key'].encode(environ['some_key.encoding']) The 'surrogateescape' way is clearly simpler. The 'equivalent' Python 2 code is even simpler: environ['some_key'] = some_value And: environ['some_key'] -- Henry Pr?cheur From fumanchu at aminus.org Tue Sep 22 00:26:35 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 15:26:35 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921205218.GA222@banane.novuscom.net> References: <4AB628C6.1000208@active-4.com><20090921175840.GA9880@banane.novuscom.net><4AB7D085.7090503@active-4.com> <20090921205218.GA222@banane.novuscom.net> Message-ID: Henry Precheur wrote: > On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote: > > So the same standard should have different behavior on different > > Python versions? That would make framework code a lot more complicated. > > I don't understand why it would be 'a lot more' complicated. > > (The following code snippets is Python 3 only, and assumes we're using > 'native strings' everywhere) > > In the gateway, environ would be populated this way: > > environ['some_key'] = some_value.decode('utf8', 'surrogateescape') > > Compare that to the utf-8-then-latin-1 alternative: > > try: > environ['some_key'] = some_value.decode('utf-8') > environ['some_key.encoding'] = 'utf-8' > except UnicodeError: > environ['some_key'] = some_value.decode('latin-1') > environ['some_key.encoding'] = 'latin-1' > > > What you would have in the application to get the original value: > > environ['some_key'].encode('utf8', 'surrogateescape') > > With utf8-then-latin1: > > environ['some_key'].encode(environ['some_key.encoding']) > > > The 'surrogateescape' way is clearly simpler. It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. Robert Brewer fumanchu at aminus.org From graham.dumpleton at gmail.com Tue Sep 22 00:45:31 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 08:45:31 +1000 Subject: [Web-SIG] buffer used by socket, should also work with python stdlib Re: Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210558r6b9f65b9wef2d360e205644b2@mail.gmail.com> References: <64ddb72c0909210415y1991807bwf977b6226f687fe9@mail.gmail.com> <88e286470909210440q3f9720f3ke70898b0fa3f50aa@mail.gmail.com> <64ddb72c0909210525s58dc2705x339317a825470f@mail.gmail.com> <4AB7714A.2060502@active-4.com> <64ddb72c0909210558r6b9f65b9wef2d360e205644b2@mail.gmail.com> Message-ID: <88e286470909211545t2e1b0d54xc7cf422431d6b6a3@mail.gmail.com> 2009/9/21 Ren? Dudfield : > On Mon, Sep 21, 2009 at 1:27 PM, Armin Ronacher > wrote: >> Hi, >> >> Ren? Dudfield wrote: >>> That's all the arguing and explaining I'll do on this - I'm not going >>> to rewrite cherrypy for you as proof. >> If it just puts a burden on implementors on the client and server side >> and there is no proof for it to be faster for real world applications we >> can probably just ignore that then. >> >> >> Regards, >> Armin >> > > hi, > > yes I think ignoring it for now is a good idea. > > > However, it could be a good addition to a future spec. > > Currently wsgi forces anything built on top to be able to not use them. > > It's zero extra work for implementors who don't want to specify a > buffer. ?Implementors and clients can just not pass in or use a > destination buffer. > > # non caring use: > buf = recv(socket, nbytes) > > # buffer caring use: > buffer = pool.get_buffer() > buf = recv(socket, nbytes, buffer) > > So I don't see it as a burden to use for people who don't care about it. > > > To explain the mmap use case more clearly... you could pass in a > memory mapped buffer to allow the process to write to disk directly... > or as shared memory so other processes can mmap the data and process > it. ?Rather than sending your data over a pipe(as in fastcgi), you can > just access it directly. > > As another piece of evidence that it is faster to use buffers, rather > than allocate all the time, nginx uses memory pools. So does > apache... and lighttpd... WSGI is specifically intended as Python specific API definition only. It isn't and will never be expanded to also encompass a wire protocol, or provide direct support for a foreign wire protocol, for communication across a socket connection or to enable optimisations across such a connection specific to some existing wire protocol. The whole point of WSGI is that it is the lowest common denominator and really really simple. That said, wsgi.file_wrapper already provides a rather large hole for at least some optimisations in returning of response data back via the client connection, albeit that not many WSGI server implementations provide such optimisations. The only constraint on wsgi.file_wrapper is that the the object supplied to it be file like to the extent of providing a read() method. This though is a fallback purely for case where the specific WSGI server cannot implement optimisations based on the actual type of the file like object supplied to it and wsgi.file_wrapper instance will act just like a normal iterable and so has to be able to read data in chunks from file like object. In Apache/mod_wsgi, if the argument to wsgi.file_wrapper is a file like object which provides a fileno() and tell() method, then on UNIX systems it will already optimise the return of the file contents by using sendfile() or memory mapping techniques. People have even used a small wrapper class around an instance of Python mmap object to allow fileno() and tell() to be visible together to satisfy that requirement and so have been able to implement optimised return of mmap'd data via Apache/mod_wsgi. In other words, Apache/mod_wsgi already provides mechanisms which avoid any in process memory copies when returning open files and/or memory mapped files. A WSGI server could already if it wanted provide a feature whereby it allowed a wsgi.file_wrapper to accept a special object which wrapped your 'buffer' data and which treated that specially and used the mechanisms you describe to send that buffer using optimised means directly out onto a socket connection with no additional copies involved. The only requirement is that the special object supply a read()/close() methods as appropriate so that it will work for WSGI servers that don't implement your optimisation. No changes are required to the WSGI specification for this part to be done now. Thus, all you need to do is convince the author of an existing pure Python WSGI server to provide the feature, or take one of the WSGI servers yourself and implement it. Graham From henry at precheur.org Tue Sep 22 01:09:52 2009 From: henry at precheur.org (Henry Precheur) Date: Mon, 21 Sep 2009 16:09:52 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> Message-ID: <20090921230952.GA13477@banane.novuscom.net> On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: > It looks simpler until you have a site that is not primarily utf-8. In > that case, you multiply your (1 line * number of middlewares in the WSGI > stack * each request). > With wsgi.uri_encoding you get either (1 line * 1 > middleware designed to transcode * each request), or even 0 if your > whole site uses just one charset. I am not sure I understand your point. The 0 lines hold true if the whole site is using latin-1 or utf-8 and you write your applications/middlewares only for this site. But if it's using any other encoding you still have to transcode. def middleware(start_response, environ): value = environ['some_key'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) ... With wsgi.uri_encoding you would still have to do the following: def middleware(start_response, environ): value = environ['some_key'].\ encode(environ['some_key.encoding']).\ decode(SITE_ENCODING) ... Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? -- Henry Pr?cheur From graham.dumpleton at gmail.com Tue Sep 22 01:16:02 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 09:16:02 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921230952.GA13477@banane.novuscom.net> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> Message-ID: <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> 2009/9/22 Henry Precheur : > On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: >> It looks simpler until you have a site that is not primarily utf-8. In >> that case, you multiply your (1 line * number of middlewares in the WSGI >> stack * each request). >> With wsgi.uri_encoding you get either (1 line * 1 >> middleware designed to transcode * each request), or even 0 if your >> whole site uses just one charset. > > I am not sure I understand your point. > > The 0 lines hold true if the whole site is using latin-1 or utf-8 and > you write your applications/middlewares only for this site. But if it's > using any other encoding you still have to transcode. > > def middleware(start_response, environ): > ? ?value = environ['some_key'].\ > ? ? ? ?encode('utf8', 'surrogateescape').\ > ? ? ? ?decode(SITE_ENCODING) > ? ?... > > With wsgi.uri_encoding you would still have to do the following: > > def middleware(start_response, environ): > ? ?value = environ['some_key'].\ > ? ? ? ?encode(environ['some_key.encoding']).\ > ? ? ? ?decode(SITE_ENCODING) > ? ?... > > Of course you can directly use `environ['some_key']` if you know you'll > get the 'right' encoding all the time. But when the encoding changes, > you'll have to fix all your middlewares. > > > I am missing something? For one, we aren't talking about arbitrary keys needing this treatment. We are only talking about SCRIPT_NAME and PATH_INFO. Everything else from CGI will be passed as ISO-8859-1 and up to WSGI components/applications to explicitly worry about those if need to deal with them in special ways. Eg., REQUEST_URI, QUERY_STRING, HTTP_COOKIE, HTTP_REFERRER. Thus, your use of 'some_key' all the time is a bit confusing when just trying to scan the emails quickly. Graham From pje at telecommunity.com Tue Sep 22 03:01:07 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 21:01:07 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.co m> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> Message-ID: <20090922010107.CE67E3A407A@sparrow.telecommunity.com> At 09:16 AM 9/22/2009 +1000, Graham Dumpleton wrote: >For one, we aren't talking about arbitrary keys needing this treatment. > >We are only talking about SCRIPT_NAME and PATH_INFO. > >Everything else from CGI will be passed as ISO-8859-1 and up to WSGI >components/applications to explicitly worry about those if need to >deal with them in special ways. Eg., REQUEST_URI, QUERY_STRING, >HTTP_COOKIE, HTTP_REFERRER. I'm not really thrilled with the idea of encoding different values differently, because it means that many more things for an implementer to remember to do correctly, but for which they receive no guidance or error messages if they get it wrong at first. One big benefit of surrogateescape is that it maintains a certain symmetry between Python 2 and 3 wrt os.environ and CGI. That is, you can in principle just throw a few extra keys into a copy of os.environ and have a valid wsgi environment. (At least, in places where the system encoding is utf8, anyway.) If you don't make it uniform across all CGI keys, then you have to write more a complex adapter, and at every level you need to remember what sort of key you're touching. From pje at telecommunity.com Tue Sep 22 03:03:06 2009 From: pje at telecommunity.com (P.J. Eby) Date: Mon, 21 Sep 2009 21:03:06 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <20090921175840.GA9880@banane.novuscom.net> <4AB7D085.7090503@active-4.com> <20090921205218.GA222@banane.novuscom.net> Message-ID: <20090922010305.9C0D63A407A@sparrow.telecommunity.com> At 03:26 PM 9/21/2009 -0700, Robert Brewer wrote: >It looks simpler until you have a site that is not primarily utf-8. In >that case, you multiply your (1 line * number of middlewares in the WSGI >stack * each request). With wsgi.uri_encoding you get either (1 line * 1 >middleware designed to transcode * each request), or even 0 if your >whole site uses just one charset. Unless I'm misunderstanding something, you end up adding an extra "if" statement *everywhere*, to check whether wsgi.uri_encoding is what you want it to be or not. (Btw, this whole notion of talking about WSGI "sites" also doesn't make sense, since WSGI doesn't have "sites", it has recursively-composable application objects. Sure, if you're using a monolithic framework, you can think of applications as unified entities, but that's not true of WSGI as a whole.) From mnot at mnot.net Tue Sep 22 03:21:03 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 11:21:03 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB6413C.5030001@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> <4AB6413C.5030001@active-4.com> Message-ID: <3A2DCF6D-B8A2-4855-B809-DCD10180A015@mnot.net> HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but HTTPbis currently takes the stance that they're ASCII, as in practice Latin-1 isn't used and may introduce interop problems. > Historically, HTTP has allowed field-content with text in the ISO- > 8859-1 [ISO-8859-1] character encoding (allowing other character > sets > through use of [RFC2047] encoding). In practice, most HTTP header > field-values use only a subset of the US-ASCII charset [USASCII]. > Newly defined header fields SHOULD constrain their field-values to > US-ASCII characters. Recipients SHOULD treat other (obs-text) > octets > in field-content as opaque data. What does it mean to "support non-ASCII headers"? As per above, the only sane thing to do is treat them as opaque data, because you can't be certain of their encoding unless you have knowledge of the header. On 21/09/2009, at 12:50 AM, Armin Ronacher wrote: > Also (something I haven't yet filed as a bug because I guess there > will > be more changes involved) the HTTP server in Python 3.1 does not > support > non-ASCII headers. -- Mark Nottingham http://www.mnot.net/ From mnot at mnot.net Tue Sep 22 03:28:00 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 11:28:00 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921193128.18C623A407A@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> Message-ID: <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> +1. There is no one answer for these issues (e.g., URI->IRI conversion can lose information), so low-level infrastructure like WSGI shouldn't be making choices for people. On 22/09/2009, at 5:31 AM, P.J. Eby wrote: > At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: >> I still don't see why the environ should have multiple versions of >> anything. It's not as if the HTTP request gives us multiple Request- >> URI's. There's a single processing step that has to happen >> somewhere: decoding the bytes of the Request-URI to unicode. For >> the vast majority of apps, it should only happen once. Twice is >> acceptable to me for some apps. As I pointed out in the linked >> email, doing that as soon as possible (i.e. in the WSGI origin >> server) allows URI's to be compared as character strings more >> easily. If you deploy a piece of middleware that transcodes (based >> on more information than servers want to deal with), it had better >> be nearly first in the stack so routing works reliably. > > The problem with this whole approach is that it's not composable. > You can't stick in an application under a router that uses a > different method for grokking its subtree of the URI space, unless > it knows what's been done to the URI and can un-do it. > > Maybe I'm missing something here, but the only way I see to preserve > composability here is to use latin-1 or bytes. > > The fundamental problem is that, like it or not, HTTP headers are > actually byte strings. The *only* reason we ever supported unicode > in WSGI was to handle platforms where there's no such thing as a non- > unicode string, and there we made it explicit that it's just a way > of manipulating *bytes*, not unicode. > > ISTM that very few (if any) of the proposals floating around for > modifying WSGI are taking this concept into account. Most of them > sound to me like people saying, "yeah, but this particular hack will > work for *my* apps... so everybody else must be doing something > stupid." > > But WSGI was built on the principle of *equally inconveniencing > everyone*, specifically to avoid an impossible attempt at consensus > between incompatible ways of doing things. (E.g., nine million > request/response APIs.) > > So, if the only problem we're going to cause by using bytes > everywhere is to make everyone need to change their routing code on > Python 3, I vote +1000. ;-) > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net -- Mark Nottingham http://www.mnot.net/ From mnot at mnot.net Tue Sep 22 03:25:00 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 11:25:00 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> Message-ID: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: > here is a summary: > Apart from python3 compatibility(which should be good enough > reason), utf-8 is what's used in http a lot these days. Most things > layered on top of wsgi are using utf-8 (django etc), and lots of web > clients are using utf-8 (firefox etc). > > Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ From mdipierro at cs.depaul.edu Tue Sep 22 03:38:05 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Mon, 21 Sep 2009 20:38:05 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> Message-ID: <5570CCEC-E6EA-43BD-B168-9675CF530764@cs.depaul.edu> +1 On Sep 21, 2009, at 8:28 PM, Mark Nottingham wrote: > +1. There is no one answer for these issues (e.g., URI->IRI conversion > can lose information), so low-level infrastructure like WSGI shouldn't > be making choices for people. > > > On 22/09/2009, at 5:31 AM, P.J. Eby wrote: > >> At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: >>> I still don't see why the environ should have multiple versions of >>> anything. It's not as if the HTTP request gives us multiple Request- >>> URI's. There's a single processing step that has to happen >>> somewhere: decoding the bytes of the Request-URI to unicode. For >>> the vast majority of apps, it should only happen once. Twice is >>> acceptable to me for some apps. As I pointed out in the linked >>> email, doing that as soon as possible (i.e. in the WSGI origin >>> server) allows URI's to be compared as character strings more >>> easily. If you deploy a piece of middleware that transcodes (based >>> on more information than servers want to deal with), it had better >>> be nearly first in the stack so routing works reliably. >> >> The problem with this whole approach is that it's not composable. >> You can't stick in an application under a router that uses a >> different method for grokking its subtree of the URI space, unless >> it knows what's been done to the URI and can un-do it. >> >> Maybe I'm missing something here, but the only way I see to preserve >> composability here is to use latin-1 or bytes. >> >> The fundamental problem is that, like it or not, HTTP headers are >> actually byte strings. The *only* reason we ever supported unicode >> in WSGI was to handle platforms where there's no such thing as a non- >> unicode string, and there we made it explicit that it's just a way >> of manipulating *bytes*, not unicode. >> >> ISTM that very few (if any) of the proposals floating around for >> modifying WSGI are taking this concept into account. Most of them >> sound to me like people saying, "yeah, but this particular hack will >> work for *my* apps... so everybody else must be doing something >> stupid." >> >> But WSGI was built on the principle of *equally inconveniencing >> everyone*, specifically to avoid an impossible attempt at consensus >> between incompatible ways of doing things. (E.g., nine million >> request/response APIs.) >> >> So, if the only problem we're going to cause by using bytes >> everywhere is to make everyone need to change their routing code on >> Python 3, I vote +1000. ;-) >> >> _______________________________________________ >> Web-SIG mailing list >> Web-SIG at python.org >> Web SIG: http://www.python.org/sigs/web-sig >> Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net > > > -- > Mark Nottingham http://www.mnot.net/ > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu From graham.dumpleton at gmail.com Tue Sep 22 04:07:27 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 12:07:27 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> Message-ID: <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> 2009/9/22 Mark Nottingham : > Most things is not the Web. How will you handle serving images through WSGI? > Compressed content? ?PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham > On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: > >> here is a summary: >> ? Apart from python3 compatibility(which should be good enough >> reason), utf-8 is what's used in http a lot these days. ?Most things >> layered on top of wsgi are using utf-8 (django etc), and lots of web >> clients are using utf-8 (firefox etc). >> >> Why not move to unicode? > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From mnot at mnot.net Tue Sep 22 04:21:40 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 12:21:40 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> Message-ID: Reference? On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> Most things is not the Web. How will you handle serving images >> through WSGI? >> Compressed content? PDFs? > > You are perhaps misunderstanding something. A WSGI application still > should return bytes. > > The whole concept of any sort of fallback to allow unicode data to be > returned for response content was purely so the canonical hello world > application as per Python 2.X could still be used on Python 3.X. > > So, we aren't saying that the only thing WSGI applications can return > is unicode strings for response content. > > Have you read my original blog post that triggered all this discussion > this time around? > > Graham > >> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >> >>> here is a summary: >>> Apart from python3 compatibility(which should be good enough >>> reason), utf-8 is what's used in http a lot these days. Most things >>> layered on top of wsgi are using utf-8 (django etc), and lots of web >>> clients are using utf-8 (firefox etc). >>> >>> Why not move to unicode? >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> _______________________________________________ >> Web-SIG mailing list >> Web-SIG at python.org >> Web SIG: http://www.python.org/sigs/web-sig >> Unsubscribe: >> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >> -- Mark Nottingham http://www.mnot.net/ From fumanchu at aminus.org Tue Sep 22 04:21:54 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 19:21:54 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> Message-ID: I've never proposed that WSGI make choices for people. I'm simply saying that a configurable server, with a sane, perfectly-reversible default, is the simplest thing that could possibly work. Robert Brewer fumanchu at aminus.org > -----Original Message----- > From: Mark Nottingham [mailto:mnot at mnot.net] > Sent: Monday, September 21, 2009 6:28 PM > To: P.J. Eby > Cc: Robert Brewer; Ren? Dudfield; Web SIG > Subject: Re: [Web-SIG] Request for Comments on upcoming WSGI Changes > > +1. There is no one answer for these issues (e.g., URI->IRI conversion > can lose information), so low-level infrastructure like WSGI shouldn't > be making choices for people. > > > On 22/09/2009, at 5:31 AM, P.J. Eby wrote: > > > At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: > >> I still don't see why the environ should have multiple versions of > >> anything. It's not as if the HTTP request gives us multiple Request- > >> URI's. There's a single processing step that has to happen > >> somewhere: decoding the bytes of the Request-URI to unicode. For > >> the vast majority of apps, it should only happen once. Twice is > >> acceptable to me for some apps. As I pointed out in the linked > >> email, doing that as soon as possible (i.e. in the WSGI origin > >> server) allows URI's to be compared as character strings more > >> easily. If you deploy a piece of middleware that transcodes (based > >> on more information than servers want to deal with), it had better > >> be nearly first in the stack so routing works reliably. > > > > The problem with this whole approach is that it's not composable. > > You can't stick in an application under a router that uses a > > different method for grokking its subtree of the URI space, unless > > it knows what's been done to the URI and can un-do it. > > > > Maybe I'm missing something here, but the only way I see to preserve > > composability here is to use latin-1 or bytes. > > > > The fundamental problem is that, like it or not, HTTP headers are > > actually byte strings. The *only* reason we ever supported unicode > > in WSGI was to handle platforms where there's no such thing as a non- > > unicode string, and there we made it explicit that it's just a way > > of manipulating *bytes*, not unicode. > > > > ISTM that very few (if any) of the proposals floating around for > > modifying WSGI are taking this concept into account. Most of them > > sound to me like people saying, "yeah, but this particular hack will > > work for *my* apps... so everybody else must be doing something > > stupid." > > > > But WSGI was built on the principle of *equally inconveniencing > > everyone*, specifically to avoid an impossible attempt at consensus > > between incompatible ways of doing things. (E.g., nine million > > request/response APIs.) > > > > So, if the only problem we're going to cause by using bytes > > everywhere is to make everyone need to change their routing code on > > Python 3, I vote +1000. ;-) > > > > _______________________________________________ > > Web-SIG mailing list > > Web-SIG at python.org > > Web SIG: http://www.python.org/sigs/web-sig > > Unsubscribe: http://mail.python.org/mailman/options/web- > sig/mnot%40mnot.net > > > -- > Mark Nottingham http://www.mnot.net/ From graham.dumpleton at gmail.com Tue Sep 22 04:26:07 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 12:26:07 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> Message-ID: <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> 2009/9/22 Mark Nottingham : > Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham > On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: > >> 2009/9/22 Mark Nottingham : >>> >>> Most things is not the Web. How will you handle serving images through >>> WSGI? >>> Compressed content? ?PDFs? >> >> You are perhaps misunderstanding something. A WSGI application still >> should return bytes. >> >> The whole concept of any sort of fallback to allow unicode data to be >> returned for response content was purely so the canonical hello world >> application as per Python 2.X could still be used on Python 3.X. >> >> So, we aren't saying that the only thing WSGI applications can return >> is unicode strings for response content. >> >> Have you read my original blog post that triggered all this discussion >> this time around? >> >> Graham >> >>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>> >>>> here is a summary: >>>> ?Apart from python3 compatibility(which should be good enough >>>> reason), utf-8 is what's used in http a lot these days. ?Most things >>>> layered on top of wsgi are using utf-8 (django etc), and lots of web >>>> clients are using utf-8 (firefox etc). >>>> >>>> Why not move to unicode? >>> >>> >>> -- >>> Mark Nottingham ? ? http://www.mnot.net/ >>> >>> _______________________________________________ >>> Web-SIG mailing list >>> Web-SIG at python.org >>> Web SIG: http://www.python.org/sigs/web-sig >>> Unsubscribe: >>> >>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>> > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > From graham.dumpleton at gmail.com Tue Sep 22 04:35:52 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 12:35:52 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB7C2C0.1080304@doxdesk.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <4AB7C2C0.1080304@doxdesk.com> Message-ID: <88e286470909211935t3c6a910cx697e5b089c162c22@mail.gmail.com> Armin has fast asleep now, so my shift. :-) He did point me to this specific email for closer attention, indicating issues with QUERY_STRING and wsgi.uri_encoding due to something mentioned here. I didn't quite get what he was talking about, but then I believe he has some wrong statements in his PEP-XXX about QUERY_STRING. I'll make a a few of my own comments about this email, and then maybe those who are still awake can help in understanding issues raised here. 2009/9/22 And Clover : > Armin Ronacher wrote: > >> The middleware can never know. > > It's much more likely than to know than the server though! > >> WSGI will demand UTF-8 URLs and only >> provide iso-XXX support for backwards compatibility. > > It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs > break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm > as much an advocate of "UTF-8 for everything everywhere!" as anyone else, > but unfortunately today there are still dark places where you need non-UTF-8 > URLs. The URLs don't break. As mentioned elsewhere, but perhaps not overly clear is that if it is known that an application or some subset of URLs will always be receiving a request as non UTF-8, then it should employ code in those cases to always transcode it to the required encoding. Thus something like: import codecs iso_8859_7 = codecs.lookup('iso-8859-7') def redecode(string, encoding): return string.encode(encoding).decode('iso-8859-7') if codecs.lookup(environ['wsgi.uri_encoding']) != iso_8859_7: environ['PATH_INFO'] = redecode(environ['PATH_INFO'], environ['wsgi.uri_encoding']) environ['SCRIPT_NAME'] = redecode(environ['SCRIPT_NAME'], environ['wsgi.uri_encoding']) environ['wsgi.uri_encoding'] = 'iso-8859-7' This could be a part of the actual application if needing to be selective based on URLs, or as a WSGI middleware that can adjust it and which wraps the WSGI application. The other fallback is that a specific WSGI server could elect to provide an option to not use 'UTF-8' as the first choice for decoding and instead use a user supplied value via the WSGI servers configuration. Robert already showed as pseudo code what the WSGI server would do: try: decode_uri(userdefault or 'utf-8') except UnicodeDecodeError: decode_uri('iso-8859-1') For a pure Python WSGI server, which effectively only supports mounting at root of site, then this may apply to whole site. In Apache/mod_wsgi however, where using Location directive in Apache one can easily apply configuration to a sub set of URLs, one could be more selective. It gets more complicated when one talks about composition of disparate WSGI components as part of an application stack. Now, although having the configuration be done outside of the WSGI application and in the web server will not appeal to some, it still may be a useful fallback for where people don't want to have to fiddle with using WSGI middleware wrappers around their whole application or around individual components to do it. Anyway, there are multiple options here. > Incidentally, if wsgi.uri_encoding is going to be the way to signal that the > server has decoded bytes to characters using a known encoding, it should be > stressed that this should only be set when that encoding is certain. > > That is, wsgi.uri_encoding should be omitted (or None?) in cases where > another party has already decoded (and maybe mangled) the bytes using an > unknown encoding. In particular, CGI. Yes, it is known that CGI and Python 3.X will be a problem. There has been a number of discussions which raised the CGI issues in the past. This time around we were possibly ignoring it for time being so that CGI script compatibility wasn't going to exclusively override us trying to make something that would work sanely for more up to date hosting methods. So, yes, having wsgi.uri_encoding be set to None for where not able to be determined what encoding is would be sensible. It may be the case that in such situations the only thing people can portably rely on is being able to use ASCII. If they know for sure what is used, they could set wsgi.uri_encoding themselves in a WSGI middleware wrapper around their application, or CGI/WSGI adapter could provide an option to allow user to set it so WSGI adapter uses user value but otherwise leaves the variables as they were. > (In the case of Windows CGI the server will have decoded URI bytes into > Unicode characters, using a charset which it is impossible to find out. In > Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF > sequence, otherwise it's the system codepage. This problem affects the > non-CGI implementation isapi_wsgi, too. Then the variables are read as > environment variables, which for Python 2 means another encode/decode step > on Windows using the system codepage, mangling non-codepage characters. > Python 3 has the opposite problem reading byte envvars using UTF-8, which > won't be how Apache put them there.) > > If wsgi.encoding is obligatory then in reality it will often be wrong, > leaving us in the same pathetic predicament as with WSGI 1.0, where > non-ASCII URIs don't work reliably at all. I'll have to research more about this, or at least the claims about Apache, as not entirely sure that is correct. Whether surrogateescape gives a better solution I have no idea at this point as haven't had a chance to delve in to it enough to understand it and no one has posted a good summary of if with actual descriptive examples of how it would work for Python 2.X/3.X. The comments about it have all assumed to a degree that you understand what it is in the first place, which is slightly annoying. Can someone perhaps give such a clear description with examples or perhaps give a reference to record in Google Groups archive where in the long email chain the dummies guide for use of surrogateescape in WSGI was posted. Now, Armin for some reason was concerned about QUERY_STRING and wsgi.uri_encoding for some reason after reading your email. I'm still not sure why. In my original blog post I talked about QUERY_STRING being dealt with along with SCRIPT_NAME and PATH_INFO as far as determining what wsgi.uri_encoding would be. Armin pointed out that QUERY_STRING by rights should only contain ASCII and so doesn't need to come into that and could be converted straight to unicode as ASCII or possibly ISO-8859-1 depending I think on which RFC you believe. Even so, in PEP-XXX it says: """ For the keys ``SCRIPT_NAME``, ``PATH_INFO`` (and ``REQUEST_URI`` if available but that variable will most likely only contain ASCII characters because it is quoted) the server has to use the following algorithm for decoding: - it decodes all values as `utf-8`. - if that fails, it decodes all values as `iso-8859-1`. The encoding the server used to decode the value is then stored in ``'wsgi.uri_encoding'``. The application MUST use this value to decode the ``'QUERY_STRING'`` as well.""" Ie., no mention of QUERY_STRING in first part, but then says that QUERY_STRING must be decoded with that as well. To say that doesn't seem right and in some respects QUERY_STRING can stand distinct from SCRIPT_NAME and PATH_INFO much as no special treatment is being given to HTTP_COOKIE and HTTP_REFERRER. If REQUEST_URI is supposed to be ASCII as well, then shouldn't it be distinct as well. Thus, wsgi.uri_encoding would only apply to SCRIPT_NAME and PATH_INFO. Although, when it comes down to just these two, also perhaps read my concerns about different encodings being applied in each as per my original blog post. The problem which arises is that unquoting of URLs in Python 3.X stdlib can only be done on unicode strings. If though a string contains non UTF-8 encoded characters it can fail. >>> urllib.parse.parse_qsl('a=b%e0') [('a', 'b?')] Or at least it shoves in characters indicating not a UTF-8 character. So, stdlib effectively forces UTF-8. This seems to be a deficiency in Python 3.X stdlib and was something believed we already knew about. I think Robert said he already had some code to do this that would work. Until Armin wakes up and explains what he who saw about QUERY_STRING that would break wsgi.uri_encoding, maybe so one can clarify how QUERY_STRING is going to be handled if stdlib doesn't work. Graham From fumanchu at aminus.org Tue Sep 22 04:40:54 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 21 Sep 2009 19:40:54 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090921230952.GA13477@banane.novuscom.net> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> Message-ID: Henry Precheur wrote: > On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: > > It looks simpler until you have a site that is not primarily utf-8. > > In that case, you multiply your (1 line * number of middlewares in the > > WSGI > > stack * each request). > > With wsgi.uri_encoding you get either (1 line * 1 > > middleware designed to transcode * each request), or even 0 if your > > whole site uses just one charset. > > I am not sure I understand your point. > > The 0 lines hold true if the whole site is using latin-1 or utf-8 and > you write your applications/middlewares only for this site. But if it's > using any other encoding you still have to transcode. > > def middleware(start_response, environ): > value = environ['some_key'].\ > encode('utf8', 'surrogateescape').\ > decode(SITE_ENCODING) > ... Yes; you have to transcode to the "correct" encoding. Once. Then every other WSGI application interface "below" that one doesn't have to care. > With wsgi.uri_encoding you would still have to do the following: > > def middleware(start_response, environ): > value = environ['some_key'].\ > encode(environ['some_key.encoding']).\ > decode(SITE_ENCODING) > ... > > Of course you can directly use `environ['some_key']` if you know you'll > get the 'right' encoding all the time. But when the encoding changes, > you'll have to fix all your middlewares. The decoding doesn't change spontaneously. You either get the correct one or you get an incorrect one. If it's incorrect, you fix it, one time, via a WSGI component which you've configured to determine the "correct" decoding. Then every other WSGI component "below" that one can go back to trusting the decoding was correct. In fact, if you do that transcoding right away, no other WSGI components need to be rewritten to take advantage of unicode. You just have to deploy a single transcoder, that's 6 lines of code max. I know PJE will chime in here and say you can't deploy a website that works differently if you happen to forget to turn on a given piece of middleware, but I also know the rest of you will drown him out from personal experience because you've *never* done that. ;) With utf8+surrogateescape, you don't transcode once, you transcode in every WSGI component in your stack that needs to "correct" the decoding. You have to do it more than once because, each time you encode/re-decode, you use the result and then throw it away. Any subsequent WSGI components have to encode/re-decode--you cannot store the redecoded URI in SCRIPT_NAME/PATH_INFO, because the utf8+surrogateescape scheme says...well, it's always utf8-decoded. In addition, *every* component that needs to compare URI's then has to be configured with the same logic, however convoluted, to perform the "correct" decoding again. It's not just routing middleware: caches need to reliably compare decoded URI's; so do sessions; so does auth (especially!); so do static files. And Heaven forfend you actually decode differently in two different components! Robert Brewer fumanchu at aminus.org From mdipierro at cs.depaul.edu Tue Sep 22 05:50:44 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Mon, 21 Sep 2009 22:50:44 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> Message-ID: Thanks Graham. I had missed it. Massimo On Sep 21, 2009, at 9:26 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> Reference? > > See: > > http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html > > Anyone else jumping in on this conversation with their own opinions > and who has not read it, should perhaps at least read that. Also read > some of the earlier posts in the numerous discussions this spawned at: > > http://groups.google.com/group/python-web-sig?lnk= > > as the current thinking isn't exactly what I blogged about and has > shifted a bit as the discussion has progressed. > > Graham > >> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> Most things is not the Web. How will you handle serving images >>>> through >>>> WSGI? >>>> Compressed content? PDFs? >>> >>> You are perhaps misunderstanding something. A WSGI application still >>> should return bytes. >>> >>> The whole concept of any sort of fallback to allow unicode data to >>> be >>> returned for response content was purely so the canonical hello >>> world >>> application as per Python 2.X could still be used on Python 3.X. >>> >>> So, we aren't saying that the only thing WSGI applications can >>> return >>> is unicode strings for response content. >>> >>> Have you read my original blog post that triggered all this >>> discussion >>> this time around? >>> >>> Graham >>> >>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>> >>>>> here is a summary: >>>>> Apart from python3 compatibility(which should be good enough >>>>> reason), utf-8 is what's used in http a lot these days. Most >>>>> things >>>>> layered on top of wsgi are using utf-8 (django etc), and lots of >>>>> web >>>>> clients are using utf-8 (firefox etc). >>>>> >>>>> Why not move to unicode? >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> _______________________________________________ >>>> Web-SIG mailing list >>>> Web-SIG at python.org >>>> Web SIG: http://www.python.org/sigs/web-sig >>>> Unsubscribe: >>>> >>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu From henry at precheur.org Tue Sep 22 06:09:36 2009 From: henry at precheur.org (Henry Precheur) Date: Mon, 21 Sep 2009 21:09:36 -0700 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> Message-ID: <20090922040936.GA30054@banane.novuscom.net> On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote: > The decoding doesn't change spontaneously. > You either get the correct one or you get an incorrect one. If it's > incorrect, you fix it, one time, via a WSGI component which you've > configured to determine the "correct" decoding. Then every other WSGI > component "below" that one can go back to trusting the decoding was > correct. In fact, if you do that transcoding right away, no other WSGI > components need to be rewritten to take advantage of unicode. You just > have to deploy a single transcoder, that's 6 lines of code max. And you can do that with utf8+surrogateescape too. Except that you don't have to determine what encoding the gateway sent you, it's always utf8+surrogateescape. > With utf8+surrogateescape, you don't transcode once, you transcode in > every WSGI component in your stack that needs to "correct" the > decoding. You have to do it more than once because, each time you > encode/re-decode, you use the result and then throw it away. Any > subsequent WSGI components have to encode/re-decode--you cannot store > the redecoded URI in SCRIPT_NAME/PATH_INFO, because the > utf8+surrogateescape scheme says...well, it's always utf8-decoded. You don't get something REALLY important with surrogateescape: You can ALWAYS get the original bytes back. >>> b = b'fran\xe7cois' >>> s = b.decode('utf8', 'surrogateescape') >>> s 'fran\udce7cois' >>> s.encode('utf8', 'surrogateescape') b'fran\xe7cois' See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a normal UTF-8 character, this character use some 'free space' in the unicode supplementary characters. The only thing you have to do is to pass 'surrogateescape' each time you call encode/decode. > In addition, *every* component that needs to compare URI's then has to > be configured with the same logic, however convoluted, to perform the > "correct" decoding again. It's not just routing middleware: caches > need to reliably compare decoded URI's; so do sessions; so does auth > (especially!); so do static files. And Heaven forfend you actually > decode differently in two different components! I don't understand why I would need to throw away the decoded string. This works perfectly well a far as I know: environ['PATH_INFO'] = environ['PATH_INFO'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) utf8+surrogateescape provides the same possibilities as wsgi.uri_encoding. You can transcode without losing information when you know what the correct encoding is. But utf8+surrogateescape is simpler because there's no need to pass around the name of the encoding in an additional variable. -- Henry Pr?cheur From graham.dumpleton at gmail.com Tue Sep 22 06:30:06 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 14:30:06 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090922040936.GA30054@banane.novuscom.net> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <20090922040936.GA30054@banane.novuscom.net> Message-ID: <88e286470909212130h14952357sc2ab7dfc5e5a4499@mail.gmail.com> 2009/9/22 Henry Precheur : > On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote: >> The decoding doesn't change spontaneously. >> You either get the correct one or you get an incorrect one. If it's >> incorrect, you fix it, one time, via a WSGI component which you've >> configured to determine the "correct" decoding. Then every other WSGI >> component "below" that one can go back to trusting the decoding was >> correct. In fact, if you do that transcoding right away, no other WSGI >> components need to be rewritten to take advantage of unicode. You just >> have to deploy a single transcoder, that's 6 lines of code max. > > And you can do that with utf8+surrogateescape too. Except that you don't > have to determine what encoding the gateway sent you, it's always > utf8+surrogateescape. > >> With utf8+surrogateescape, you don't transcode once, you transcode in >> every WSGI component in your stack that needs to "correct" the >> decoding. You have to do it more than once because, each time you >> encode/re-decode, you use the result and then throw it away. Any >> subsequent WSGI components have to encode/re-decode--you cannot store >> the redecoded URI in SCRIPT_NAME/PATH_INFO, because the >> utf8+surrogateescape scheme says...well, it's always utf8-decoded. > > You don't get something REALLY important with surrogateescape: You can > ALWAYS get the original bytes back. > > ? ?>>> b = b'fran\xe7cois' > ? ?>>> s = b.decode('utf8', 'surrogateescape') > ? ?>>> s > ? ?'fran\udce7cois' > ? ?>>> s.encode('utf8', 'surrogateescape') > ? ?b'fran\xe7cois' Hooray, an example finally which shows what the data looks like. If one reads: http://www.python.org/dev/peps/pep-0383/ there is no actual example in it which shows what is actually in the unicode string. So unless you go play with the code it is hard to understand what is actually happening. Yeah, yeah, I may be slow to get things but I don't have the time to go playing with every suggestion. ;-) Note, still not saying whether surrogateescape is good or not, but this is helping me to understand. Someone did say something about being able to half make it work on Python 2.X. Can someone properly provide example code for Python 2.X. If we want uniformity in how interface works on Python 2.X and 3.X, they we have to be able to use same method without tricks. This is why wsgi.uri_encoding at the moment seems better, as not reliant on a feature only in Python 3.1+. Graham > See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a > normal UTF-8 character, this character use some 'free space' in the > unicode supplementary characters. > > The only thing you have to do is to pass 'surrogateescape' each time you > call encode/decode. > >> In addition, *every* component that needs to compare URI's then has to >> be configured with the same logic, however convoluted, to perform the >> "correct" decoding again. It's not just routing middleware: caches >> need to reliably compare decoded URI's; so do sessions; so does auth >> (especially!); so do static files. And Heaven forfend you actually >> decode differently in two different components! > > I don't understand why I would need to throw away the decoded string. > > This works perfectly well a far as I know: > > ? ?environ['PATH_INFO'] = environ['PATH_INFO'].\ > ? ? ? ? ?encode('utf8', 'surrogateescape').\ > ? ? ? ? ?decode(SITE_ENCODING) > > utf8+surrogateescape provides the same possibilities as > wsgi.uri_encoding. You can transcode without losing information when you > know what the correct encoding is. But utf8+surrogateescape is simpler > because there's no need to pass around the name of the encoding in an > additional variable. > > -- > ?Henry Pr?cheur > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From pje at telecommunity.com Tue Sep 22 06:33:05 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 22 Sep 2009 00:33:05 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> Message-ID: <20090922043305.9913B3A407A@sparrow.telecommunity.com> At 07:40 PM 9/21/2009 -0700, Robert Brewer wrote: >Yes; you have to transcode to the "correct" encoding. Once. Then every >other WSGI application interface "below" that one doesn't have to care. You can only do that if you *break encapsulation*, which as I said earlier is voiding the entire point of having a modular interface. Having a configurable encoding just means that *every* WSGI application *must* verify the encoding in order to be safe. I'm all in favor of making everyone suffer equally, but all else being equal, I'd prefer them to suffer idempotently rather than conditionally. ;-) From pje at telecommunity.com Tue Sep 22 06:33:10 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 22 Sep 2009 00:33:10 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> Message-ID: <20090922043314.4F4D53A415F@sparrow.telecommunity.com> At 07:21 PM 9/21/2009 -0700, Robert Brewer wrote: >I've never proposed that WSGI make choices for people. I'm simply >saying that a configurable server, with a sane, perfectly-reversible >default, is the simplest thing that could possibly work. Actually, latin-1 bytes encoding is the *simplest* thing that could possibly work, since it works already in e.g. Jython, and is actually in the spec already... and any framework that wants unicode URIs already has to decode them, so the code is already written. From graham.dumpleton at gmail.com Tue Sep 22 06:49:36 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 14:49:36 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090922043314.4F4D53A415F@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> Message-ID: <88e286470909212149kcfaef27s5085acf7c55b7ee2@mail.gmail.com> 2009/9/22 P.J. Eby : > At 07:21 PM 9/21/2009 -0700, Robert Brewer wrote: >> >> I've never proposed that WSGI make choices for people. I'm simply saying >> that a configurable server, with a sane, perfectly-reversible default, is >> the simplest thing that could possibly work. > > Actually, latin-1 bytes encoding is the *simplest* thing that could possibly > work, since it works already in e.g. Jython, and is actually in the spec > already... Except to the extent I originally pointed out, that comparing Jython to Python 3.0 isn't necessarily appropriate because Python 3.0 ended up with its own bytes type. :-) Ignoring that, I still seem some validity in it given that the complaints originally made by people that they wanted bytes everywhere still were before they realised how much of a pain bytes were. For example, Armin has turned right around now and accepts that bytes isn't going to work. Armin's starting point though was the proposal of trying to be smart about encoding to try and satisfy the bytes everywhere camps concerns. Thus he was looking more at the issues arounds wsgi.uri_encoding and how to make that work, rather than perhaps whether it was strictly needed and whether latin-1 would work fine. > ?and any framework that wants unicode URIs already has to decode > them, so the code is already written. Graham From ianb at colorstudy.com Tue Sep 22 07:09:17 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 22 Sep 2009 00:09:17 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> Message-ID: On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > > Of course you can directly use `environ['some_key']` if you know you'll > > get the 'right' encoding all the time. But when the encoding changes, > > you'll have to fix all your middlewares. > > > > > > I am missing something? > > For one, we aren't talking about arbitrary keys needing this treatment. > > We are only talking about SCRIPT_NAME and PATH_INFO. > OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and introduce two equivalent variables that hold the NOT url-decoded values. So if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'. This will be quite disruptive, as these are variables that are frequently accessed directly (libraries that expose them as attributes can just turn them into properties that do URL decoding, using UTF8). But it's an easy fix at least. I would actually want to specify that if we added this key, we should disallow the old keys -- terrible confusion could ensue from both in the environ. This also fixes the problem with not being able to distinguish %2F from /, which isn't a big problem but is annoying, and is hiding meaningful information. (I believe the relevant spec does distinguish between these two values -- i.e., ideally decoding should happen on path segments, each segment separated by a real /.) If we do that, then the only really tricky thing left is HTTP_COOKIE, and since the Cookie header is a mess then HTTP_COOKIE will be a mess and we just have to figure out a hacky way to deal with that. Maybe surrogateescape, but probably just Latin1 would be fine (and easy to do in Python 2). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Tue Sep 22 07:21:13 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 15:21:13 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> Message-ID: <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> 2009/9/22 Ian Bicking : > On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton > wrote: >> >> > Of course you can directly use `environ['some_key']` if you know you'll >> > get the 'right' encoding all the time. But when the encoding changes, >> > you'll have to fix all your middlewares. >> > >> > >> > I am missing something? >> >> For one, we aren't talking about arbitrary keys needing this treatment. >> >> We are only talking about SCRIPT_NAME and PATH_INFO. > > OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and > introduce two equivalent variables that hold the NOT url-decoded values. ?So > if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'. > This will be quite disruptive, as these are variables that are frequently > accessed directly (libraries that expose them as attributes can just turn > them into properties that do URL decoding, using UTF8). ?But it's an easy > fix at least. ?I would actually want to specify that if we added this key, > we should disallow the old keys -- terrible confusion could ensue from both > in the environ. ?This also fixes the problem with not being able to > distinguish %2F from /, which isn't a big problem but is annoying, and is > hiding meaningful information. ?(I believe the relevant spec does > distinguish between these two values -- i.e., ideally decoding should happen > on path segments, each segment separated by a real /.) > If we do that, then the only really tricky thing left is HTTP_COOKIE, and > since the Cookie header is a mess then HTTP_COOKIE will be a mess and we > just have to figure out a hacky way to deal with that. ?Maybe > surrogateescape, but probably just Latin1 would be fine (and easy to do in > Python 2). That may be fine for pure Python web servers where you control the split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as that is done by the web server. Also, as pointed out in my blog, because of rewrites in web server, it may be difficult to try and map SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and reclaim original characters. There is also the problem that often FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and manual overrides needed to tweak them. Graham From ianb at colorstudy.com Tue Sep 22 07:31:20 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 22 Sep 2009 00:31:20 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> Message-ID: On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > > That may be fine for pure Python web servers where you control the > split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place > but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as > that is done by the web server. Also, as pointed out in my blog, > because of rewrites in web server, it may be difficult to try and map > SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and > reclaim original characters. There is also the problem that often > FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and > manual overrides needed to tweak them. When things get messed up I recommend people use a middleware (paste.deploy.config.PrefixMiddleware, though I don't really care what they use) to fix up the request to be correct. Pulling it from REQUEST_URI would be fine. Also, at worst, you can do environ['SCRIPT_NAME_RAW'] = urllib.quote(environ.pop('SCRIPT_NAME')). It sucks, but if that's all the information you have, then that's all the information you have. Or try to get the information from REQUEST_URI the hard way, once at the gateway level. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Tue Sep 22 07:38:21 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 15:38:21 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> Message-ID: <88e286470909212238w2d2a731v2272fdb9a3d03c9c@mail.gmail.com> 2009/9/22 Ian Bicking : > On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton > wrote: >> >> That may be fine for pure Python web servers where you control the >> split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place >> but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as >> that is done by the web server. Also, as pointed out in my blog, >> because of rewrites in web server, it may be difficult to try and map >> SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and >> reclaim original characters. There is also the problem that often >> FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and >> manual overrides needed to tweak them. > > When things get messed up I recommend people use a middleware > (paste.deploy.config.PrefixMiddleware, though I don't really care what they > use) to fix up the request to be correct. ?Pulling it from REQUEST_URI would > be fine. > Also, at worst, you can do environ['SCRIPT_NAME_RAW'] = > urllib.quote(environ.pop('SCRIPT_NAME')). ?It sucks, but if that's all the > information you have, then that's all the information you have. ?Or try to > get the information from REQUEST_URI the hard way, once at the gateway > level. Probably doable to just reverse it using underlying raw bytes. At least in mod_wsgi the SCRIPT_NAME/PATH_INFO split is always correct, unless people really screw it up by using WSGIScriptAliasMatch or AliasMatch wrongly. If doing something like you suggest, would prefer them as 'wsgi.' prefixed variables and not put in all upper case namespace to be confused with CGI variables etc. Graham Graham From ianb at colorstudy.com Tue Sep 22 07:47:01 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 22 Sep 2009 00:47:01 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212238w2d2a731v2272fdb9a3d03c9c@mail.gmail.com> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> <88e286470909212238w2d2a731v2272fdb9a3d03c9c@mail.gmail.com> Message-ID: On Tue, Sep 22, 2009 at 12:38 AM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > If doing something like you suggest, would prefer them as 'wsgi.' > prefixed variables and not put in all upper case namespace to be > confused with CGI variables etc. > I just had to make up a name, but I agree with your suggestion for wsgi.X (we already have wsgi.url_scheme, after all). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From pje at telecommunity.com Tue Sep 22 07:48:17 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 22 Sep 2009 01:48:17 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212130h14952357sc2ab7dfc5e5a4499@mail.gmail.co m> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <20090922040936.GA30054@banane.novuscom.net> <88e286470909212130h14952357sc2ab7dfc5e5a4499@mail.gmail.com> Message-ID: <20090922054818.145AE3A407A@sparrow.telecommunity.com> At 02:30 PM 9/22/2009 +1000, Graham Dumpleton wrote: >Someone did say something about being able to half make it work on >Python 2.X. Can someone properly provide example code for Python 2.X. The issue is that error handlers on encode are only allowed to provide substitute unicode characters, not substitute bytes. That's why it can only "half work" on 2.x. >If we want uniformity in how interface works on Python 2.X and 3.X, >they we have to be able to use same method without tricks. This is why >wsgi.uri_encoding at the moment seems better, as not reliant on a >feature only in Python 3.1+. If we want uniformity in the interface, then we should continue using latin-1, which already works today. Yes, it sucks, but it sucks *uniformly*. There really isn't going to be a solution that satisfies *all* of the criteria we're batting around, for *all* the users. What's happening is that the principals are focused on different scenarios, where all their criteria can be met at the expense of others'. I'm tending to flip-flop a bit myself, because my goal is that *nobody* "wins", in the sense of having an advantaged framework, server, programming paradigm, etc. relative to others. And that means there are more ways of doing it that would be acceptable to me. For example, all bytes, all latin-1, all surrogateescape... I don't care all that terribly much between them, I just want it to be uniform for everybody using/implementing the spec. (And that also means I want it uniform across all keys, not just the URI ones; I don't want to have to remember which ones are special cases.) If some people need to do more code because of their particular codec requirements, that's okay by me, as long as it's *unconditional* code that doesn't depend on some sort of configuration rigamarole. That makes the spec brittle, because nobody's going to test their edge cases, and then the consumers of the code are gonna be the ones getting screwed over. Frankly, 90% of WSGI code written will never even check the wsgi.version number, so why would we think anybody's going to actually check wsgi.url_encoding? That's just building in the suck from day one. No offense intended to the proposer of it; it's a fine solution for a single project's API, but it's just not going to scale. We already know this, because most WSGI code written is not to spec. The ones of us here in the room talking about this are *not* good examples of average WSGI programmers, because (hopefully) we've all at least studied the spec and endeavored to fully grok and conform to it. (Hell, an unfortunately large number of people think you're supposed to use write() or yield to send *individual lines* of text.) So you better believe that everybody else is going to copy the worst available examples of other people's WSGI code and ignore any documentation associated with it... and then they will expect it to work on your server. ;-) Thus, our target audience is people who will rotely copy... which means we need an API they can either copy by rote, or know is wrong when they get an error message. Conditionals and error handling are too much to ask of them, as is remembering different rules for different environ keys that all kind of look alike. (There's a reason we required ALL_CAPS keys to be the same type in the first spec.) From mnot at mnot.net Tue Sep 22 08:06:26 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 16:06:26 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> Message-ID: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> Reference? > > See: > > http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html > > Anyone else jumping in on this conversation with their own opinions > and who has not read it, should perhaps at least read that. Also read > some of the earlier posts in the numerous discussions this spawned at: > > http://groups.google.com/group/python-web-sig?lnk= > > as the current thinking isn't exactly what I blogged about and has > shifted a bit as the discussion has progressed. > > Graham > >> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> Most things is not the Web. How will you handle serving images >>>> through >>>> WSGI? >>>> Compressed content? PDFs? >>> >>> You are perhaps misunderstanding something. A WSGI application still >>> should return bytes. >>> >>> The whole concept of any sort of fallback to allow unicode data to >>> be >>> returned for response content was purely so the canonical hello >>> world >>> application as per Python 2.X could still be used on Python 3.X. >>> >>> So, we aren't saying that the only thing WSGI applications can >>> return >>> is unicode strings for response content. >>> >>> Have you read my original blog post that triggered all this >>> discussion >>> this time around? >>> >>> Graham >>> >>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>> >>>>> here is a summary: >>>>> Apart from python3 compatibility(which should be good enough >>>>> reason), utf-8 is what's used in http a lot these days. Most >>>>> things >>>>> layered on top of wsgi are using utf-8 (django etc), and lots of >>>>> web >>>>> clients are using utf-8 (firefox etc). >>>>> >>>>> Why not move to unicode? >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> _______________________________________________ >>>> Web-SIG mailing list >>>> Web-SIG at python.org >>>> Web SIG: http://www.python.org/sigs/web-sig >>>> Unsubscribe: >>>> >>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> -- Mark Nottingham http://www.mnot.net/ From graham.dumpleton at gmail.com Tue Sep 22 08:07:12 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 16:07:12 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090922054818.145AE3A407A@sparrow.telecommunity.com> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <20090922040936.GA30054@banane.novuscom.net> <88e286470909212130h14952357sc2ab7dfc5e5a4499@mail.gmail.com> <20090922054818.145AE3A407A@sparrow.telecommunity.com> Message-ID: <88e286470909212307y4e1614b7gabd132100dbb509f@mail.gmail.com> 2009/9/22 P.J. Eby : > I'm tending to flip-flop a bit myself For the record, I am doing that as well. Graham From graham.dumpleton at gmail.com Tue Sep 22 08:07:58 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 16:07:58 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> Message-ID: <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> 2009/9/22 Mark Nottingham : > OK, that's quite exhaustive. > > For the benefit of those of us jumping in, could you summarise your proposal > in something like the following manner: > > 1. How the request method is made available to WSGI applications > 2. How the request-uri is made available to WSGI applications -- in > particular, whether any decoding of punycode and/or %-escapes happens > 3. How request headers are made available to WSGI apps > 4. How the request body is made available to to WSGI apps > 5. Likewise for how apps should expose the response status message, headers > and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham > Cheers, > > > On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: > >> 2009/9/22 Mark Nottingham : >>> >>> Reference? >> >> See: >> >> >> ?http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >> >> Anyone else jumping in on this conversation with their own opinions >> and who has not read it, should perhaps at least read that. Also read >> some of the earlier posts in the numerous discussions this spawned at: >> >> ?http://groups.google.com/group/python-web-sig?lnk= >> >> as the current thinking isn't exactly what I blogged about and has >> shifted a bit as the discussion has progressed. >> >> Graham >> >>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>> >>>> 2009/9/22 Mark Nottingham : >>>>> >>>>> Most things is not the Web. How will you handle serving images through >>>>> WSGI? >>>>> Compressed content? ?PDFs? >>>> >>>> You are perhaps misunderstanding something. A WSGI application still >>>> should return bytes. >>>> >>>> The whole concept of any sort of fallback to allow unicode data to be >>>> returned for response content was purely so the canonical hello world >>>> application as per Python 2.X could still be used on Python 3.X. >>>> >>>> So, we aren't saying that the only thing WSGI applications can return >>>> is unicode strings for response content. >>>> >>>> Have you read my original blog post that triggered all this discussion >>>> this time around? >>>> >>>> Graham >>>> >>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>> >>>>>> here is a summary: >>>>>> ?Apart from python3 compatibility(which should be good enough >>>>>> reason), utf-8 is what's used in http a lot these days. ?Most things >>>>>> layered on top of wsgi are using utf-8 (django etc), and lots of web >>>>>> clients are using utf-8 (firefox etc). >>>>>> >>>>>> Why not move to unicode? >>>>> >>>>> >>>>> -- >>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>> >>>>> _______________________________________________ >>>>> Web-SIG mailing list >>>>> Web-SIG at python.org >>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>> Unsubscribe: >>>>> >>>>> >>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>> >>> >>> >>> -- >>> Mark Nottingham ? ? http://www.mnot.net/ >>> >>> > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > From mnot at mnot.net Tue Sep 22 08:25:02 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 16:25:02 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> Message-ID: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated "as binary byte sequences", as per PEP 333? Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> OK, that's quite exhaustive. >> >> For the benefit of those of us jumping in, could you summarise your >> proposal >> in something like the following manner: >> >> 1. How the request method is made available to WSGI applications >> 2. How the request-uri is made available to WSGI applications -- in >> particular, whether any decoding of punycode and/or %-escapes happens >> 3. How request headers are made available to WSGI apps >> 4. How the request body is made available to to WSGI apps >> 5. Likewise for how apps should expose the response status message, >> headers >> and body to WSGI implementations. > > Same as the WSGI PEP. > > http://www.python.org/dev/peps/pep-0333/ > > Nothing has changed in that respect. > > Graham > >> Cheers, >> >> >> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> Reference? >>> >>> See: >>> >>> >>> http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>> >>> Anyone else jumping in on this conversation with their own opinions >>> and who has not read it, should perhaps at least read that. Also >>> read >>> some of the earlier posts in the numerous discussions this spawned >>> at: >>> >>> http://groups.google.com/group/python-web-sig?lnk= >>> >>> as the current thinking isn't exactly what I blogged about and has >>> shifted a bit as the discussion has progressed. >>> >>> Graham >>> >>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>> >>>>> 2009/9/22 Mark Nottingham : >>>>>> >>>>>> Most things is not the Web. How will you handle serving images >>>>>> through >>>>>> WSGI? >>>>>> Compressed content? PDFs? >>>>> >>>>> You are perhaps misunderstanding something. A WSGI application >>>>> still >>>>> should return bytes. >>>>> >>>>> The whole concept of any sort of fallback to allow unicode data >>>>> to be >>>>> returned for response content was purely so the canonical hello >>>>> world >>>>> application as per Python 2.X could still be used on Python 3.X. >>>>> >>>>> So, we aren't saying that the only thing WSGI applications can >>>>> return >>>>> is unicode strings for response content. >>>>> >>>>> Have you read my original blog post that triggered all this >>>>> discussion >>>>> this time around? >>>>> >>>>> Graham >>>>> >>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>> >>>>>>> here is a summary: >>>>>>> Apart from python3 compatibility(which should be good enough >>>>>>> reason), utf-8 is what's used in http a lot these days. Most >>>>>>> things >>>>>>> layered on top of wsgi are using utf-8 (django etc), and lots >>>>>>> of web >>>>>>> clients are using utf-8 (firefox etc). >>>>>>> >>>>>>> Why not move to unicode? >>>>>> >>>>>> >>>>>> -- >>>>>> Mark Nottingham http://www.mnot.net/ >>>>>> >>>>>> _______________________________________________ >>>>>> Web-SIG mailing list >>>>>> Web-SIG at python.org >>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>> Unsubscribe: >>>>>> >>>>>> >>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>> >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> -- Mark Nottingham http://www.mnot.net/ From graham.dumpleton at gmail.com Tue Sep 22 08:36:55 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 16:36:55 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> Message-ID: <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> 2009/9/22 Mark Nottingham : > So, what advice do you propose about decoding bytes into strings for the > request-URI / method / request headers, and vice versa for response headers > and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are > errors handled? > > Are bodies still treated "as binary byte sequences", as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham > Cheers, > > On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: > >> 2009/9/22 Mark Nottingham : >>> >>> OK, that's quite exhaustive. >>> >>> For the benefit of those of us jumping in, could you summarise your >>> proposal >>> in something like the following manner: >>> >>> 1. How the request method is made available to WSGI applications >>> 2. How the request-uri is made available to WSGI applications -- in >>> particular, whether any decoding of punycode and/or %-escapes happens >>> 3. How request headers are made available to WSGI apps >>> 4. How the request body is made available to to WSGI apps >>> 5. Likewise for how apps should expose the response status message, >>> headers >>> and body to WSGI implementations. >> >> Same as the WSGI PEP. >> >> ?http://www.python.org/dev/peps/pep-0333/ >> >> Nothing has changed in that respect. >> >> Graham >> >>> Cheers, >>> >>> >>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>> >>>> 2009/9/22 Mark Nottingham : >>>>> >>>>> Reference? >>>> >>>> See: >>>> >>>> >>>> >>>> ?http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>> >>>> Anyone else jumping in on this conversation with their own opinions >>>> and who has not read it, should perhaps at least read that. Also read >>>> some of the earlier posts in the numerous discussions this spawned at: >>>> >>>> ?http://groups.google.com/group/python-web-sig?lnk= >>>> >>>> as the current thinking isn't exactly what I blogged about and has >>>> shifted a bit as the discussion has progressed. >>>> >>>> Graham >>>> >>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>> >>>>>> 2009/9/22 Mark Nottingham : >>>>>>> >>>>>>> Most things is not the Web. How will you handle serving images >>>>>>> through >>>>>>> WSGI? >>>>>>> Compressed content? ?PDFs? >>>>>> >>>>>> You are perhaps misunderstanding something. A WSGI application still >>>>>> should return bytes. >>>>>> >>>>>> The whole concept of any sort of fallback to allow unicode data to be >>>>>> returned for response content was purely so the canonical hello world >>>>>> application as per Python 2.X could still be used on Python 3.X. >>>>>> >>>>>> So, we aren't saying that the only thing WSGI applications can return >>>>>> is unicode strings for response content. >>>>>> >>>>>> Have you read my original blog post that triggered all this discussion >>>>>> this time around? >>>>>> >>>>>> Graham >>>>>> >>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>> >>>>>>>> here is a summary: >>>>>>>> ?Apart from python3 compatibility(which should be good enough >>>>>>>> reason), utf-8 is what's used in http a lot these days. ?Most things >>>>>>>> layered on top of wsgi are using utf-8 (django etc), and lots of web >>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>> >>>>>>>> Why not move to unicode? >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Web-SIG mailing list >>>>>>> Web-SIG at python.org >>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>> Unsubscribe: >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>> >>>>> >>> >>> >>> -- >>> Mark Nottingham ? ? http://www.mnot.net/ >>> >>> > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > From mnot at mnot.net Tue Sep 22 08:41:29 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 16:41:29 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> Message-ID: <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. Thanks, On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> So, what advice do you propose about decoding bytes into strings >> for the >> request-URI / method / request headers, and vice versa for response >> headers >> and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How >> are >> errors handled? >> >> Are bodies still treated "as binary byte sequences", as per PEP 333? > > I thought my blog post explained that reasonably well. Ensure you read > the numbered definitions. > > If you can't work it out from the blog, point at the specific thing in > the blog you don't understand and can help. Don't really want to go > explaining it all again. > > Graham > >> Cheers, >> >> On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> OK, that's quite exhaustive. >>>> >>>> For the benefit of those of us jumping in, could you summarise your >>>> proposal >>>> in something like the following manner: >>>> >>>> 1. How the request method is made available to WSGI applications >>>> 2. How the request-uri is made available to WSGI applications -- in >>>> particular, whether any decoding of punycode and/or %-escapes >>>> happens >>>> 3. How request headers are made available to WSGI apps >>>> 4. How the request body is made available to to WSGI apps >>>> 5. Likewise for how apps should expose the response status message, >>>> headers >>>> and body to WSGI implementations. >>> >>> Same as the WSGI PEP. >>> >>> http://www.python.org/dev/peps/pep-0333/ >>> >>> Nothing has changed in that respect. >>> >>> Graham >>> >>>> Cheers, >>>> >>>> >>>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>>> >>>>> 2009/9/22 Mark Nottingham : >>>>>> >>>>>> Reference? >>>>> >>>>> See: >>>>> >>>>> >>>>> >>>>> http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>>> >>>>> Anyone else jumping in on this conversation with their own >>>>> opinions >>>>> and who has not read it, should perhaps at least read that. Also >>>>> read >>>>> some of the earlier posts in the numerous discussions this >>>>> spawned at: >>>>> >>>>> http://groups.google.com/group/python-web-sig?lnk= >>>>> >>>>> as the current thinking isn't exactly what I blogged about and has >>>>> shifted a bit as the discussion has progressed. >>>>> >>>>> Graham >>>>> >>>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>>> >>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>> >>>>>>>> Most things is not the Web. How will you handle serving images >>>>>>>> through >>>>>>>> WSGI? >>>>>>>> Compressed content? PDFs? >>>>>>> >>>>>>> You are perhaps misunderstanding something. A WSGI application >>>>>>> still >>>>>>> should return bytes. >>>>>>> >>>>>>> The whole concept of any sort of fallback to allow unicode >>>>>>> data to be >>>>>>> returned for response content was purely so the canonical >>>>>>> hello world >>>>>>> application as per Python 2.X could still be used on Python 3.X. >>>>>>> >>>>>>> So, we aren't saying that the only thing WSGI applications can >>>>>>> return >>>>>>> is unicode strings for response content. >>>>>>> >>>>>>> Have you read my original blog post that triggered all this >>>>>>> discussion >>>>>>> this time around? >>>>>>> >>>>>>> Graham >>>>>>> >>>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>>> >>>>>>>>> here is a summary: >>>>>>>>> Apart from python3 compatibility(which should be good enough >>>>>>>>> reason), utf-8 is what's used in http a lot these days. >>>>>>>>> Most things >>>>>>>>> layered on top of wsgi are using utf-8 (django etc), and >>>>>>>>> lots of web >>>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>>> >>>>>>>>> Why not move to unicode? >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Web-SIG mailing list >>>>>>>> Web-SIG at python.org >>>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>>> Unsubscribe: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mark Nottingham http://www.mnot.net/ >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> -- Mark Nottingham http://www.mnot.net/ From ianb at colorstudy.com Tue Sep 22 08:43:23 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 22 Sep 2009 01:43:23 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> Message-ID: It's not a specific proposal, but here's my opinions on what a proposal should be: On Tue, Sep 22, 2009 at 1:06 AM, Mark Nottingham wrote: > OK, that's quite exhaustive. > > For the benefit of those of us jumping in, could you summarise your > proposal in something like the following manner: > > 1. How the request method is made available to WSGI applications > Graham talked about it as bytes/unicode/native, where native is unicode on Python 3 and str on Python 2. For instance, I think there's general consensus (though not really specifically discussed) that environ keys should be native. I think method should be native. > 2. How the request-uri is made available to WSGI applications -- in > particular, whether any decoding of punycode and/or %-escapes happens > Hah, didn't even think about de-punycoding HTTP_HOST. That'd be a blast. I think: * scheme as native * HTTP_HOST as native (no decoding of punycode) * path as native (no URL decoding) - big break with WSGI 1 and CGI, but what the hell. I could easily waffle on this. * query string as native - *should* be ASCII-safe currently. Wow, that was easy! Request headers, which you didn't split out... those I'm not sure. I'd *like* them to be native. But damn, I'm just not sure quite how. surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape isn't so bad. And the headers *should* be ASCII for sane requests, so it's not a horrible compromise. I guess libraries could lazilly transcode, just like they currently lazily decode. But it'd be a bit obnoxious at the library level. Transcoding middleware would be easier, but it adds the question of how to record that the transcoding has taken place. > 3. How request headers are made available to WSGI apps > Request handlers? I don't understand your terminology. > 4. How the request body is made available to to WSGI apps > Ugh. wsgi.input could remain. I think at least it should become a file-like interface (i.e., giving an empty string when the content is exausted) and I might even ask that it implement .tell() (.seek() would be nice of course, but optional). If there was some other idea, I think there's room for improvement on wsgi.input and the file interface. wsgi.input should definitely work with bytes only. I believe this is consensus. > 5. Likewise for how apps should expose the response status message, headers > and body to WSGI implementations. > I believe there is consensus that the response body should remain an iterator that yields bytes. In one way, it'd be nice if we'd just say that status/headers should be ASCII, because that's the reasonable choice. But for proxying or representing "HTTP as it is", it's not always the case. And I'm committed to keeping WSGI fully capable of representing arbitrary requests and responses so long as they aren't entirely diabololical. But, an ASCII status is not unreasonable, especially since there's zero semantic meaning to the reason. Which makes native strings perfectly fine. So, headers... Well, Latin1 is easy enough. In theory, or at least particular theories, headers can be Latin1. And you can represent arbitrary bytes that way. So if you want to send crazy stuff to the browser, you can do it that way. And if you want to stick to plain ASCII then that's easy enough as well. So... native? str or unicode? I'm not sure specifically for this one. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Tue Sep 22 08:44:55 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 16:44:55 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> References: <4AB628C6.1000208@active-4.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> Message-ID: <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> 2009/9/22 Mark Nottingham : > That blog entry is eleven printed pages. Given that PEP 333 also prints as > eleven pages from my browser, I suspect there's some extraneous information > in there. > > Could you please summarise? Requiring all comers to read such a voluminous > entry is a considerable (and somewhat arbitrary) bar to entry for the > discussion. If you aren't willing to read the PEP to understand WSGI why are you even wanting to participate in the discussion in the first place? This is a quite detailed discussion about the future of the WSGI specification and not an IRC channel manned by ticket monkeys. :-( Graham > Thanks, > > > On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: > >> 2009/9/22 Mark Nottingham : >>> >>> So, what advice do you propose about decoding bytes into strings for the >>> request-URI / method / request headers, and vice versa for response >>> headers >>> and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are >>> errors handled? >>> >>> Are bodies still treated "as binary byte sequences", as per PEP 333? >> >> I thought my blog post explained that reasonably well. Ensure you read >> the numbered definitions. >> >> If you can't work it out from the blog, point at the specific thing in >> the blog you don't understand and can help. Don't really want to go >> explaining it all again. >> >> Graham >> >>> Cheers, >>> >>> On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: >>> >>>> 2009/9/22 Mark Nottingham : >>>>> >>>>> OK, that's quite exhaustive. >>>>> >>>>> For the benefit of those of us jumping in, could you summarise your >>>>> proposal >>>>> in something like the following manner: >>>>> >>>>> 1. How the request method is made available to WSGI applications >>>>> 2. How the request-uri is made available to WSGI applications -- in >>>>> particular, whether any decoding of punycode and/or %-escapes happens >>>>> 3. How request headers are made available to WSGI apps >>>>> 4. How the request body is made available to to WSGI apps >>>>> 5. Likewise for how apps should expose the response status message, >>>>> headers >>>>> and body to WSGI implementations. >>>> >>>> Same as the WSGI PEP. >>>> >>>> ?http://www.python.org/dev/peps/pep-0333/ >>>> >>>> Nothing has changed in that respect. >>>> >>>> Graham >>>> >>>>> Cheers, >>>>> >>>>> >>>>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>>>> >>>>>> 2009/9/22 Mark Nottingham : >>>>>>> >>>>>>> Reference? >>>>>> >>>>>> See: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ?http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>>>> >>>>>> Anyone else jumping in on this conversation with their own opinions >>>>>> and who has not read it, should perhaps at least read that. Also read >>>>>> some of the earlier posts in the numerous discussions this spawned at: >>>>>> >>>>>> ?http://groups.google.com/group/python-web-sig?lnk= >>>>>> >>>>>> as the current thinking isn't exactly what I blogged about and has >>>>>> shifted a bit as the discussion has progressed. >>>>>> >>>>>> Graham >>>>>> >>>>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>>>> >>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>> >>>>>>>>> Most things is not the Web. How will you handle serving images >>>>>>>>> through >>>>>>>>> WSGI? >>>>>>>>> Compressed content? ?PDFs? >>>>>>>> >>>>>>>> You are perhaps misunderstanding something. A WSGI application still >>>>>>>> should return bytes. >>>>>>>> >>>>>>>> The whole concept of any sort of fallback to allow unicode data to >>>>>>>> be >>>>>>>> returned for response content was purely so the canonical hello >>>>>>>> world >>>>>>>> application as per Python 2.X could still be used on Python 3.X. >>>>>>>> >>>>>>>> So, we aren't saying that the only thing WSGI applications can >>>>>>>> return >>>>>>>> is unicode strings for response content. >>>>>>>> >>>>>>>> Have you read my original blog post that triggered all this >>>>>>>> discussion >>>>>>>> this time around? >>>>>>>> >>>>>>>> Graham >>>>>>>> >>>>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>>>> >>>>>>>>>> here is a summary: >>>>>>>>>> ?Apart from python3 compatibility(which should be good enough >>>>>>>>>> reason), utf-8 is what's used in http a lot these days. ?Most >>>>>>>>>> things >>>>>>>>>> layered on top of wsgi are using utf-8 (django etc), and lots of >>>>>>>>>> web >>>>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>>>> >>>>>>>>>> Why not move to unicode? >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Web-SIG mailing list >>>>>>>>> Web-SIG at python.org >>>>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>>>> Unsubscribe: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>> >>>>> >>> >>> >>> -- >>> Mark Nottingham ? ? http://www.mnot.net/ >>> >>> > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > From mnot at mnot.net Tue Sep 22 08:52:16 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 16:52:16 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> Message-ID: <4D7F73CD-223B-4F4B-B2ED-5F2D020A45A5@mnot.net> You're twisting my words; nowhere did I say i wasn't willing to read the PEP. What I did say was that a proposal can and should be made in less than eleven pages; I'd like to give my feedback, both because I use Python and because I have some interest in HTTP. However, my time is limited, and I already have a stack of other things to review on my desk. He who writes the most words does not (hopefully, for the sake of the Python community) win. I appreciate that you've taken the time to reason out a proposal, but the minutia of how you got to that place should not obscure the proposal itself. I'm not sure how to take your "ticket monkeys" comment, so I'll ignore it. On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> That blog entry is eleven printed pages. Given that PEP 333 also >> prints as >> eleven pages from my browser, I suspect there's some extraneous >> information >> in there. >> >> Could you please summarise? Requiring all comers to read such a >> voluminous >> entry is a considerable (and somewhat arbitrary) bar to entry for the >> discussion. > > If you aren't willing to read the PEP to understand WSGI why are you > even wanting to participate in the discussion in the first place? This > is a quite detailed discussion about the future of the WSGI > specification and not an IRC channel manned by ticket monkeys. :-( > > Graham > >> Thanks, >> >> >> On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> So, what advice do you propose about decoding bytes into strings >>>> for the >>>> request-URI / method / request headers, and vice versa for response >>>> headers >>>> and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? >>>> How are >>>> errors handled? >>>> >>>> Are bodies still treated "as binary byte sequences", as per PEP >>>> 333? >>> >>> I thought my blog post explained that reasonably well. Ensure you >>> read >>> the numbered definitions. >>> >>> If you can't work it out from the blog, point at the specific >>> thing in >>> the blog you don't understand and can help. Don't really want to go >>> explaining it all again. >>> >>> Graham >>> >>>> Cheers, >>>> >>>> On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: >>>> >>>>> 2009/9/22 Mark Nottingham : >>>>>> >>>>>> OK, that's quite exhaustive. >>>>>> >>>>>> For the benefit of those of us jumping in, could you summarise >>>>>> your >>>>>> proposal >>>>>> in something like the following manner: >>>>>> >>>>>> 1. How the request method is made available to WSGI applications >>>>>> 2. How the request-uri is made available to WSGI applications >>>>>> -- in >>>>>> particular, whether any decoding of punycode and/or %-escapes >>>>>> happens >>>>>> 3. How request headers are made available to WSGI apps >>>>>> 4. How the request body is made available to to WSGI apps >>>>>> 5. Likewise for how apps should expose the response status >>>>>> message, >>>>>> headers >>>>>> and body to WSGI implementations. >>>>> >>>>> Same as the WSGI PEP. >>>>> >>>>> http://www.python.org/dev/peps/pep-0333/ >>>>> >>>>> Nothing has changed in that respect. >>>>> >>>>> Graham >>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>>>>> >>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>> >>>>>>>> Reference? >>>>>>> >>>>>>> See: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>>>>> >>>>>>> Anyone else jumping in on this conversation with their own >>>>>>> opinions >>>>>>> and who has not read it, should perhaps at least read that. >>>>>>> Also read >>>>>>> some of the earlier posts in the numerous discussions this >>>>>>> spawned at: >>>>>>> >>>>>>> http://groups.google.com/group/python-web-sig?lnk= >>>>>>> >>>>>>> as the current thinking isn't exactly what I blogged about and >>>>>>> has >>>>>>> shifted a bit as the discussion has progressed. >>>>>>> >>>>>>> Graham >>>>>>> >>>>>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>>>>> >>>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>>> >>>>>>>>>> Most things is not the Web. How will you handle serving >>>>>>>>>> images >>>>>>>>>> through >>>>>>>>>> WSGI? >>>>>>>>>> Compressed content? PDFs? >>>>>>>>> >>>>>>>>> You are perhaps misunderstanding something. A WSGI >>>>>>>>> application still >>>>>>>>> should return bytes. >>>>>>>>> >>>>>>>>> The whole concept of any sort of fallback to allow unicode >>>>>>>>> data to >>>>>>>>> be >>>>>>>>> returned for response content was purely so the canonical >>>>>>>>> hello >>>>>>>>> world >>>>>>>>> application as per Python 2.X could still be used on Python >>>>>>>>> 3.X. >>>>>>>>> >>>>>>>>> So, we aren't saying that the only thing WSGI applications can >>>>>>>>> return >>>>>>>>> is unicode strings for response content. >>>>>>>>> >>>>>>>>> Have you read my original blog post that triggered all this >>>>>>>>> discussion >>>>>>>>> this time around? >>>>>>>>> >>>>>>>>> Graham >>>>>>>>> >>>>>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>>>>> >>>>>>>>>>> here is a summary: >>>>>>>>>>> Apart from python3 compatibility(which should be good >>>>>>>>>>> enough >>>>>>>>>>> reason), utf-8 is what's used in http a lot these days. >>>>>>>>>>> Most >>>>>>>>>>> things >>>>>>>>>>> layered on top of wsgi are using utf-8 (django etc), and >>>>>>>>>>> lots of >>>>>>>>>>> web >>>>>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>>>>> >>>>>>>>>>> Why not move to unicode? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Web-SIG mailing list >>>>>>>>>> Web-SIG at python.org >>>>>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>>>>> Unsubscribe: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mark Nottingham http://www.mnot.net/ >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> -- Mark Nottingham http://www.mnot.net/ From alan at xhaus.com Tue Sep 22 09:00:13 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 08:00:13 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> Message-ID: <4a951aa00909220000q4570e90bs64b2e6ed2941ac7@mail.gmail.com> [Ian] >> OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO >> introduce two equivalent variables that hold the NOT url-decoded values. [Graham] > That may be fine for pure Python web servers where you control the > split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place > but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as > that is done by the web server. Also, as pointed out in my blog, > because of rewrites in web server, it may be difficult to try and map > SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and > reclaim original characters. There is also the problem that often > FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and > manual overrides needed to tweak them. This applies doubly under Java servlets, where different containers take different approaches to solve these rather hard problems. It is worth noting that they have to do so because the java servlet spec, even under the most recent 2.5, punts on *all* of the issues being discussed here. See here for how Tomcat does it. Or half does it, messily. http://wiki.apache.org/tomcat/FAQ/CharacterEncoding I know this is not helpful ;-) Alan. From alan at xhaus.com Tue Sep 22 09:05:43 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 08:05:43 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <20090921205218.GA222@banane.novuscom.net> <20090921230952.GA13477@banane.novuscom.net> <88e286470909211616v6bdbe886u2e325d119274d8d9@mail.gmail.com> <88e286470909212221wc65fcbdl4b7fc3e586c3a611@mail.gmail.com> Message-ID: <4a951aa00909220005r2ed3a611nabde772c93ac108e@mail.gmail.com> [Ian] > When things get messed up I recommend people use a middleware > (paste.deploy.config.PrefixMiddleware, though I don't really care what they > use) to fix up the request to be correct. ?Pulling it from REQUEST_URI would > be fine. That would be unworkable under java servlet containers, since they each take a different approach to addressing encoding issues, or fail to deal with them entirely. So there would probably have to be a special case for every single one of these http://en.wikipedia.org/wiki/List_of_Servlet_containers Each of which has a number of different ways of being configured in relation to these issues. I don't know if it would even be possible to write such a middleware. And retain all of one's hair. Alan. From graham.dumpleton at gmail.com Tue Sep 22 09:10:03 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 22 Sep 2009 17:10:03 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4D7F73CD-223B-4F4B-B2ED-5F2D020A45A5@mnot.net> References: <4AB628C6.1000208@active-4.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> <4D7F73CD-223B-4F4B-B2ED-5F2D020A45A5@mnot.net> Message-ID: <88e286470909220010l5d747bc1hdfc5700ed1e49aa6@mail.gmail.com> 2009/9/22 Mark Nottingham : > You're twisting my words; nowhere did I say i wasn't willing to read the > PEP. What I did say was that a proposal can and should be made in less than > eleven pages; I'd like to give my feedback, both because I use Python and > because I have some interest in HTTP. However, my time is limited, and I > already have a stack of other things to review on my desk. > > He who writes the most words does not (hopefully, for the sake of the Python > community) win. I appreciate that you've taken the time to reason out a > proposal, but the minutia of how you got to that place should not obscure > the proposal itself. > > I'm not sure how to take your "ticket monkeys" comment, so I'll ignore it. Sorry if I come across as being short. None of us has time and this whole WSGI on Python 3.0 issue has been going on since start of last year. Many of us are quite tired of it all. I also don't personally know who you are, not recollecting seeing your name in any past discussions. I am told though you were involved back at time of original WSGI specification drafting, so apologies. The ticket monkeys reference is just the allusion to a help desk. I always think of what happens when people jump on IRC as being worst case. That is, they treat people there like help desk staff who only exist to serve them and not anyone else. So, you see people who have a complex problem, pose a question in a single line. They then expect a even more complex solution to there problem, usually expressed in one line again. There is a book I have been meaning to read called the 'Trusted Advisor' which apparently goes on about providing assistance to others as comparing the idea of being like a ticket monkey (help desk), versus building a relationship with people in order to understand their real issues and provide better solutions. Obviously being an advisor rather than a help desk is ultimately going to be better for the people needing help, but if the customer has the frame of mind that you are just the help desk and don't want to put any effort into the relationship, it is hard to try and be that advisor. So, I felt a bit like a help desk in the way I interpreted your comments. Graham > On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote: > >> 2009/9/22 Mark Nottingham : >>> >>> That blog entry is eleven printed pages. Given that PEP 333 also prints >>> as >>> eleven pages from my browser, I suspect there's some extraneous >>> information >>> in there. >>> >>> Could you please summarise? Requiring all comers to read such a >>> voluminous >>> entry is a considerable (and somewhat arbitrary) bar to entry for the >>> discussion. >> >> If you aren't willing to read the PEP to understand WSGI why are you >> even wanting to participate in the discussion in the first place? This >> is a quite detailed discussion about the future of the WSGI >> specification and not an IRC channel manned by ticket monkeys. :-( >> >> Graham >> >>> Thanks, >>> >>> >>> On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: >>> >>>> 2009/9/22 Mark Nottingham : >>>>> >>>>> So, what advice do you propose about decoding bytes into strings for >>>>> the >>>>> request-URI / method / request headers, and vice versa for response >>>>> headers >>>>> and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are >>>>> errors handled? >>>>> >>>>> Are bodies still treated "as binary byte sequences", as per PEP 333? >>>> >>>> I thought my blog post explained that reasonably well. Ensure you read >>>> the numbered definitions. >>>> >>>> If you can't work it out from the blog, point at the specific thing in >>>> the blog you don't understand and can help. Don't really want to go >>>> explaining it all again. >>>> >>>> Graham >>>> >>>>> Cheers, >>>>> >>>>> On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: >>>>> >>>>>> 2009/9/22 Mark Nottingham : >>>>>>> >>>>>>> OK, that's quite exhaustive. >>>>>>> >>>>>>> For the benefit of those of us jumping in, could you summarise your >>>>>>> proposal >>>>>>> in something like the following manner: >>>>>>> >>>>>>> 1. How the request method is made available to WSGI applications >>>>>>> 2. How the request-uri is made available to WSGI applications -- in >>>>>>> particular, whether any decoding of punycode and/or %-escapes happens >>>>>>> 3. How request headers are made available to WSGI apps >>>>>>> 4. How the request body is made available to to WSGI apps >>>>>>> 5. Likewise for how apps should expose the response status message, >>>>>>> headers >>>>>>> and body to WSGI implementations. >>>>>> >>>>>> Same as the WSGI PEP. >>>>>> >>>>>> ?http://www.python.org/dev/peps/pep-0333/ >>>>>> >>>>>> Nothing has changed in that respect. >>>>>> >>>>>> Graham >>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> >>>>>>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>>>>>> >>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>> >>>>>>>>> Reference? >>>>>>>> >>>>>>>> See: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ?http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>>>>>> >>>>>>>> Anyone else jumping in on this conversation with their own opinions >>>>>>>> and who has not read it, should perhaps at least read that. Also >>>>>>>> read >>>>>>>> some of the earlier posts in the numerous discussions this spawned >>>>>>>> at: >>>>>>>> >>>>>>>> ?http://groups.google.com/group/python-web-sig?lnk= >>>>>>>> >>>>>>>> as the current thinking isn't exactly what I blogged about and has >>>>>>>> shifted a bit as the discussion has progressed. >>>>>>>> >>>>>>>> Graham >>>>>>>> >>>>>>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>>>>>> >>>>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>>>> >>>>>>>>>>> Most things is not the Web. How will you handle serving images >>>>>>>>>>> through >>>>>>>>>>> WSGI? >>>>>>>>>>> Compressed content? ?PDFs? >>>>>>>>>> >>>>>>>>>> You are perhaps misunderstanding something. A WSGI application >>>>>>>>>> still >>>>>>>>>> should return bytes. >>>>>>>>>> >>>>>>>>>> The whole concept of any sort of fallback to allow unicode data to >>>>>>>>>> be >>>>>>>>>> returned for response content was purely so the canonical hello >>>>>>>>>> world >>>>>>>>>> application as per Python 2.X could still be used on Python 3.X. >>>>>>>>>> >>>>>>>>>> So, we aren't saying that the only thing WSGI applications can >>>>>>>>>> return >>>>>>>>>> is unicode strings for response content. >>>>>>>>>> >>>>>>>>>> Have you read my original blog post that triggered all this >>>>>>>>>> discussion >>>>>>>>>> this time around? >>>>>>>>>> >>>>>>>>>> Graham >>>>>>>>>> >>>>>>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>>>>>> >>>>>>>>>>>> here is a summary: >>>>>>>>>>>> ?Apart from python3 compatibility(which should be good enough >>>>>>>>>>>> reason), utf-8 is what's used in http a lot these days. ?Most >>>>>>>>>>>> things >>>>>>>>>>>> layered on top of wsgi are using utf-8 (django etc), and lots of >>>>>>>>>>>> web >>>>>>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>>>>>> >>>>>>>>>>>> Why not move to unicode? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Web-SIG mailing list >>>>>>>>>>> Web-SIG at python.org >>>>>>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>>>>>> Unsubscribe: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Mark Nottingham ? ? http://www.mnot.net/ >>>>> >>>>> >>> >>> >>> -- >>> Mark Nottingham ? ? http://www.mnot.net/ >>> >>> > > > -- > Mark Nottingham ? ? http://www.mnot.net/ > > From mnot at mnot.net Tue Sep 22 09:17:39 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 17:17:39 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909220010l5d747bc1hdfc5700ed1e49aa6@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> <4D7F73CD-223B-4F4B-B2ED-5F2D020A45A5@mnot.net> <88e286470909220010l5d747bc1hdfc5700ed1e49aa6@mail.gmail.com> Message-ID: No worries, and apologies for the manner of asking; I just wanted to provide feedback from an HTTP perspective before it got too far down the road. I'm happy to wait a bit longer for it to bake if that's more helpful. Cheers, On 22/09/2009, at 5:10 PM, Graham Dumpleton wrote: > 2009/9/22 Mark Nottingham : >> You're twisting my words; nowhere did I say i wasn't willing to >> read the >> PEP. What I did say was that a proposal can and should be made in >> less than >> eleven pages; I'd like to give my feedback, both because I use >> Python and >> because I have some interest in HTTP. However, my time is limited, >> and I >> already have a stack of other things to review on my desk. >> >> He who writes the most words does not (hopefully, for the sake of >> the Python >> community) win. I appreciate that you've taken the time to reason >> out a >> proposal, but the minutia of how you got to that place should not >> obscure >> the proposal itself. >> >> I'm not sure how to take your "ticket monkeys" comment, so I'll >> ignore it. > > Sorry if I come across as being short. > > None of us has time and this whole WSGI on Python 3.0 issue has been > going on since start of last year. Many of us are quite tired of it > all. I also don't personally know who you are, not recollecting seeing > your name in any past discussions. I am told though you were involved > back at time of original WSGI specification drafting, so apologies. > > The ticket monkeys reference is just the allusion to a help desk. I > always think of what happens when people jump on IRC as being worst > case. That is, they treat people there like help desk staff who only > exist to serve them and not anyone else. So, you see people who have a > complex problem, pose a question in a single line. They then expect a > even more complex solution to there problem, usually expressed in one > line again. > > There is a book I have been meaning to read called the 'Trusted > Advisor' which apparently goes on about providing assistance to others > as comparing the idea of being like a ticket monkey (help desk), > versus building a relationship with people in order to understand > their real issues and provide better solutions. Obviously being an > advisor rather than a help desk is ultimately going to be better for > the people needing help, but if the customer has the frame of mind > that you are just the help desk and don't want to put any effort into > the relationship, it is hard to try and be that advisor. > > So, I felt a bit like a help desk in the way I interpreted your > comments. > > Graham > >> On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote: >> >>> 2009/9/22 Mark Nottingham : >>>> >>>> That blog entry is eleven printed pages. Given that PEP 333 also >>>> prints >>>> as >>>> eleven pages from my browser, I suspect there's some extraneous >>>> information >>>> in there. >>>> >>>> Could you please summarise? Requiring all comers to read such a >>>> voluminous >>>> entry is a considerable (and somewhat arbitrary) bar to entry for >>>> the >>>> discussion. >>> >>> If you aren't willing to read the PEP to understand WSGI why are you >>> even wanting to participate in the discussion in the first place? >>> This >>> is a quite detailed discussion about the future of the WSGI >>> specification and not an IRC channel manned by ticket monkeys. :-( >>> >>> Graham >>> >>>> Thanks, >>>> >>>> >>>> On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: >>>> >>>>> 2009/9/22 Mark Nottingham : >>>>>> >>>>>> So, what advice do you propose about decoding bytes into >>>>>> strings for >>>>>> the >>>>>> request-URI / method / request headers, and vice versa for >>>>>> response >>>>>> headers >>>>>> and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? >>>>>> How are >>>>>> errors handled? >>>>>> >>>>>> Are bodies still treated "as binary byte sequences", as per PEP >>>>>> 333? >>>>> >>>>> I thought my blog post explained that reasonably well. Ensure >>>>> you read >>>>> the numbered definitions. >>>>> >>>>> If you can't work it out from the blog, point at the specific >>>>> thing in >>>>> the blog you don't understand and can help. Don't really want to >>>>> go >>>>> explaining it all again. >>>>> >>>>> Graham >>>>> >>>>>> Cheers, >>>>>> >>>>>> On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: >>>>>> >>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>> >>>>>>>> OK, that's quite exhaustive. >>>>>>>> >>>>>>>> For the benefit of those of us jumping in, could you >>>>>>>> summarise your >>>>>>>> proposal >>>>>>>> in something like the following manner: >>>>>>>> >>>>>>>> 1. How the request method is made available to WSGI >>>>>>>> applications >>>>>>>> 2. How the request-uri is made available to WSGI applications >>>>>>>> -- in >>>>>>>> particular, whether any decoding of punycode and/or %-escapes >>>>>>>> happens >>>>>>>> 3. How request headers are made available to WSGI apps >>>>>>>> 4. How the request body is made available to to WSGI apps >>>>>>>> 5. Likewise for how apps should expose the response status >>>>>>>> message, >>>>>>>> headers >>>>>>>> and body to WSGI implementations. >>>>>>> >>>>>>> Same as the WSGI PEP. >>>>>>> >>>>>>> http://www.python.org/dev/peps/pep-0333/ >>>>>>> >>>>>>> Nothing has changed in that respect. >>>>>>> >>>>>>> Graham >>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> >>>>>>>> On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: >>>>>>>> >>>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>>> >>>>>>>>>> Reference? >>>>>>>>> >>>>>>>>> See: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html >>>>>>>>> >>>>>>>>> Anyone else jumping in on this conversation with their own >>>>>>>>> opinions >>>>>>>>> and who has not read it, should perhaps at least read that. >>>>>>>>> Also >>>>>>>>> read >>>>>>>>> some of the earlier posts in the numerous discussions this >>>>>>>>> spawned >>>>>>>>> at: >>>>>>>>> >>>>>>>>> http://groups.google.com/group/python-web-sig?lnk= >>>>>>>>> >>>>>>>>> as the current thinking isn't exactly what I blogged about >>>>>>>>> and has >>>>>>>>> shifted a bit as the discussion has progressed. >>>>>>>>> >>>>>>>>> Graham >>>>>>>>> >>>>>>>>>> On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: >>>>>>>>>> >>>>>>>>>>> 2009/9/22 Mark Nottingham : >>>>>>>>>>>> >>>>>>>>>>>> Most things is not the Web. How will you handle serving >>>>>>>>>>>> images >>>>>>>>>>>> through >>>>>>>>>>>> WSGI? >>>>>>>>>>>> Compressed content? PDFs? >>>>>>>>>>> >>>>>>>>>>> You are perhaps misunderstanding something. A WSGI >>>>>>>>>>> application >>>>>>>>>>> still >>>>>>>>>>> should return bytes. >>>>>>>>>>> >>>>>>>>>>> The whole concept of any sort of fallback to allow unicode >>>>>>>>>>> data to >>>>>>>>>>> be >>>>>>>>>>> returned for response content was purely so the canonical >>>>>>>>>>> hello >>>>>>>>>>> world >>>>>>>>>>> application as per Python 2.X could still be used on >>>>>>>>>>> Python 3.X. >>>>>>>>>>> >>>>>>>>>>> So, we aren't saying that the only thing WSGI applications >>>>>>>>>>> can >>>>>>>>>>> return >>>>>>>>>>> is unicode strings for response content. >>>>>>>>>>> >>>>>>>>>>> Have you read my original blog post that triggered all this >>>>>>>>>>> discussion >>>>>>>>>>> this time around? >>>>>>>>>>> >>>>>>>>>>> Graham >>>>>>>>>>> >>>>>>>>>>>> On 22/09/2009, at 1:30 AM, Ren? Dudfield wrote: >>>>>>>>>>>> >>>>>>>>>>>>> here is a summary: >>>>>>>>>>>>> Apart from python3 compatibility(which should be good >>>>>>>>>>>>> enough >>>>>>>>>>>>> reason), utf-8 is what's used in http a lot these days. >>>>>>>>>>>>> Most >>>>>>>>>>>>> things >>>>>>>>>>>>> layered on top of wsgi are using utf-8 (django etc), and >>>>>>>>>>>>> lots of >>>>>>>>>>>>> web >>>>>>>>>>>>> clients are using utf-8 (firefox etc). >>>>>>>>>>>>> >>>>>>>>>>>>> Why not move to unicode? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Web-SIG mailing list >>>>>>>>>>>> Web-SIG at python.org >>>>>>>>>>>> Web SIG: http://www.python.org/sigs/web-sig >>>>>>>>>>>> Unsubscribe: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mark Nottingham http://www.mnot.net/ >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mark Nottingham http://www.mnot.net/ >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Mark Nottingham http://www.mnot.net/ >>>> >>>> >> >> >> -- >> Mark Nottingham http://www.mnot.net/ >> >> -- Mark Nottingham http://www.mnot.net/ From armin.ronacher at active-4.com Tue Sep 22 09:51:32 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 09:51:32 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <20090922043314.4F4D53A415F@sparrow.telecommunity.com> References: <4AB628C6.1000208@active-4.com> <4AB7766E.2060703@doxdesk.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> Message-ID: <4AB88204.6080703@active-4.com> Hi, P.J. Eby schrieb: > Actually, latin-1 bytes encoding is the *simplest* thing that could > possibly work, since it works already in e.g. Jython, and is actually > in the spec already... and any framework that wants unicode URIs > already has to decode them, so the code is already written. Except that nobody implements that and that Jython has a standard Python 2.x byte string. Regards, Armin From armin.ronacher at active-4.com Tue Sep 22 10:11:57 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 10:11:57 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <3A2DCF6D-B8A2-4855-B809-DCD10180A015@mnot.net> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> <4AB6413C.5030001@active-4.com> <3A2DCF6D-B8A2-4855-B809-DCD10180A015@mnot.net> Message-ID: <4AB886CD.6010500@active-4.com> Hi, Mark Nottingham schrieb: > HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but > HTTPbis currently takes the stance that they're ASCII, as in practice > Latin-1 isn't used and may introduce interop problems. In practise non-ascii data ends up in headers. > What does it mean to "support non-ASCII headers"? As per above, the > only sane thing to do is treat them as opaque data, because you can't > be certain of their encoding unless you have knowledge of the header. Here what http.server does in Python 3 (actual code): def send_header(self, keyword, value): """Send a MIME header.""" if self.request_version != 'HTTP/0.9': self.wfile.write(("%s: %s\r\n" % (keyword, value)).encode('ASCII', 'strict')) if keyword.lower() == 'connection': if value.lower() == 'close': self.close_connection = 1 elif value.lower() == 'keep-alive': self.close_connection = 0 So it will give you a nice UnicodeEncodeError if you try to send anything outside of the ASCII range as header. Regards, Armin From mnot at mnot.net Tue Sep 22 10:16:07 2009 From: mnot at mnot.net (Mark Nottingham) Date: Tue, 22 Sep 2009 18:16:07 +1000 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB886CD.6010500@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090920144350.839F13A403D@sparrow.telecommunity.com> <4AB6413C.5030001@active-4.com> <3A2DCF6D-B8A2-4855-B809-DCD10180A015@mnot.net> <4AB886CD.6010500@active-4.com> Message-ID: <0C3D16B7-3E45-4CFF-8457-C6888A03D2F9@mnot.net> On 22/09/2009, at 6:11 PM, Armin Ronacher wrote: > Hi, > > Mark Nottingham schrieb: >> HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but >> HTTPbis currently takes the stance that they're ASCII, as in practice >> Latin-1 isn't used and may introduce interop problems. > In practise non-ascii data ends up in headers. Yes. However, it shouldn't be encouraged. > >> What does it mean to "support non-ASCII headers"? As per above, the >> only sane thing to do is treat them as opaque data, because you can't >> be certain of their encoding unless you have knowledge of the header. > Here what http.server does in Python 3 (actual code): > > def send_header(self, keyword, value): > """Send a MIME header.""" > if self.request_version != 'HTTP/0.9': > self.wfile.write(("%s: %s\r\n" % (keyword, > value)).encode('ASCII', 'strict')) > > if keyword.lower() == 'connection': > if value.lower() == 'close': > self.close_connection = 1 > elif value.lower() == 'keep-alive': > self.close_connection = 0 > > So it will give you a nice UnicodeEncodeError if you try to send > anything outside of the ASCII range as header. Ouch. -- Mark Nottingham http://www.mnot.net/ From armin.ronacher at active-4.com Tue Sep 22 10:16:30 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 10:16:30 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <4AB70044.8010204@plope.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> Message-ID: <4AB887DE.4010503@active-4.com> Hi, Ian Bicking schrieb: > Request headers, which you didn't split out... those I'm not sure. I'd > *like* them to be native. But damn, I'm just not sure quite how. > surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape > isn't so bad. And the headers *should* be ASCII for sane requests, so it's > not a horrible compromise. Except for cookie headers. Thanks to advertising and all the other system putting headers on your page you can't even properly control that one. Another thing to consider: in Python 3.1, the HTTP server internally decodes to latin1 and there is no simple way to change that, unless you replace the implementation. > Ugh. wsgi.input could remain. I think at least it should become a > file-like interface (i.e., giving an empty string when the content is > exausted) and I might even ask that it implement .tell() (.seek() would be > nice of course, but optional). If there was some other idea, I think > there's room for improvement on wsgi.input and the file interface. -1 on seek and tell. This could be impossible to implement and what we really want to do is to not have the data in memory but on disk or whereever you put big-ass uploads. Also it will be hard to test for an avaiable seek or not, because even if it's a noop, the method could be there. Regards, Armin From alan at xhaus.com Tue Sep 22 10:23:32 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 09:23:32 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB88204.6080703@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> Message-ID: <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> [P.J. Eby] >> Actually, latin-1 bytes encoding is the *simplest* thing that could >> possibly work, since it works already in e.g. Jython, and is actually >> in the spec already... ?and any framework that wants unicode URIs >> already has to decode them, so the code is already written. [Armin] > Except that nobody implements that So, if nobody implements that, then why are we trying to standardise it? Is there a real need out there? Or are all these discussions solely driven by the need/desire to have only unicode strings in the WSGI dictionary under python 3? Which is a worthy goal, IMHO. Java has been there since the very start, since java strings have always been unicode. Take a look at the java docs for HttpServlet: no methods return bytes/bytearrays. http://java.sun.com/products/servlet/2.5/docs/servlet-2_5-mr2/javax/servlet/http/HttpServletRequest.html But the java servlet spec still ignores *all* of the encoding concerns being discussed here. Which means that mistakes/mojibake must happen all the time. And it's up to the author of the individual java web application to solve those problems, using a mechanism appropriate for their needs and local environment. Java programmers just tolerate this, although they may curse the developers of the servlet spec for not having solved their specific problem for them. Alan. From ianb at colorstudy.com Tue Sep 22 10:26:46 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 22 Sep 2009 03:26:46 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB887DE.4010503@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <4AB887DE.4010503@active-4.com> Message-ID: On Tue, Sep 22, 2009 at 3:16 AM, Armin Ronacher wrote: > Hi, > > Ian Bicking schrieb: > > Request headers, which you didn't split out... those I'm not sure. I'd > > *like* them to be native. But damn, I'm just not sure quite how. > > surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape > > isn't so bad. And the headers *should* be ASCII for sane requests, so > it's > > not a horrible compromise. > Except for cookie headers. Thanks to advertising and all the other > system putting headers on your page you can't even properly control that > one. > Yes, but it'd be relatively easy to handle this, especially since the raw header isn't very useful. So you just do environ['HTTP_COOKIE'].encode('latin1').decode('utf8', 'replace') before parsing. Another thing to consider: in Python 3.1, the HTTP server internally > decodes to latin1 and there is no simple way to change that, unless you > replace the implementation. > > > Ugh. wsgi.input could remain. I think at least it should become a > > file-like interface (i.e., giving an empty string when the content is > > exausted) and I might even ask that it implement .tell() (.seek() would > be > > nice of course, but optional). If there was some other idea, I think > > there's room for improvement on wsgi.input and the file interface. > -1 on seek and tell. This could be impossible to implement and what we > really want to do is to not have the data in memory but on disk or > whereever you put big-ass uploads. Also it will be hard to test for an > avaiable seek or not, because even if it's a noop, the method could be > there. > Tell doesn't have particular overhead except to keep track of how many bytes have been read. That would allow libraries to at least detect contention for wsgi.input. I wish seek were detectable, though I agree it shouldn't be required at all. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker -------------- next part -------------- An HTML attachment was scrubbed... URL: From armin.ronacher at active-4.com Tue Sep 22 10:29:33 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 10:29:33 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> Message-ID: <4AB88AED.1010003@active-4.com> Hi, Alan Kennedy schrieb: > So, if nobody implements that, then why are we trying to standardise it? I think that was just one of the ideas that were discussed. Just to sum it up a bit where we already went: - my initial plan was going bytes everywhere. Turns out, on Python 3 this is nearly impossible to do because the majority of the standard library went an unicode path, even where bytes would be more appropriate (like cgi.FieldStorage, urllib.parse etc.) - Graham, Robert (and now me as well) try to get charset guessing for URLs going, decide on latin1 for the HTTP headers. latin1 could be re-decoded by the application if it really thinks it wanted utf-8 for instance. (Like cookie headers, only I guess only there) - One idea is enforcing unicode for all Python versions - One idea is going unicode for Python 3 and bytestrings for Python 2 - New (and old) discussions bring up the surrogate escapes. So it's quite hard to follow because different people talk about different ideas at the same time. And so far none of them looks really compelling. > Is there a real need out there? In python 3, yes. Because the stdlib no longer works with bytes and the bytes object has few string semantics left. > Which is a worthy goal, IMHO. Java has been there since the very > start, since java strings have always been unicode. Take a look at the > java docs for HttpServlet: no methods return bytes/bytearrays. And people appear to have problems with that, because what they are doing is using a specified charset that is by default iso-8859-1: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding > Java programmers just tolerate this, although they may curse the > developers of the servlet spec for not having solved their specific > problem for them. Many Java apps are also still using latin1 only or have all kinds of problems with charsets. Regards, Armin From armin.ronacher at active-4.com Tue Sep 22 10:31:18 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 10:31:18 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: References: <4AB628C6.1000208@active-4.com> <20090921151951.2BD973A403D@sparrow.telecommunity.com> <64ddb72c0909210830v7d1aaa50re72c88b968c4bbff@mail.gmail.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <4AB887DE.4010503@active-4.com> Message-ID: <4AB88B56.9060204@active-4.com> Hi, Ian Bicking schrieb: > Tell doesn't have particular overhead except to keep track of how many bytes > have been read. That would allow libraries to at least detect contention > for wsgi.input. I wish seek were detectable, though I agree it shouldn't be > required at all. Ah right. Thought there was no use case left, but that sounds like a good idea. +1 then. That's however something that could directly co into the updated 0333 (WSGI 1.1). Regards, Armin From alan at xhaus.com Tue Sep 22 11:06:59 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 10:06:59 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB88AED.1010003@active-4.com> References: <4AB628C6.1000208@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> Message-ID: <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> [Alan] >> Is there a real need out there? [Armin] > In python 3, yes. Because the stdlib no longer works with bytes and the > bytes object has few string semantics left. Why can't we just do the same as the java servlet spec? I.E. 1. Ignore the encoding issues being discussed 2. Give the programmer (possibly mojibake) unicode strings in the WSGI environ anyway 3. And let them solve their problems themselves, using server configuration or bespoke middleware [Alan] >> Java programmers just tolerate this, although they may curse the >> developers of the servlet spec for not having solved their specific >> problem for them. [Armin] > Many Java apps are also still using latin1 only or have all kinds of > problems with charsets. My point exactly. Many web developers simply never have to deal with these issues, perhaps a majority. The ones that do have to sort it out for themselves. To do so, the publishers of the various containers give them (non-standard) options to control the decoding of the incoming request and all of its component parts: you cited the Tomcat approach above. Other containers do it differently. Which means that i18n knowledge is not portable between containers. It would be nice if we could avoid such a situation with i18n and WSGI. But I suppose I'm a little dubious that this group can out-do the enormous java community, and the enormous financial resources that Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve this complex problem satisfactorily. Alan. From armin.ronacher at active-4.com Tue Sep 22 11:28:59 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 11:28:59 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> Message-ID: <4AB898DB.8080801@active-4.com> Hi, Alan Kennedy schrieb: > 2. Give the programmer (possibly mojibake) unicode strings in the WSGI > environ anyway > 3. And let them solve their problems themselves, using server > configuration or bespoke middleware Because that problem was solved a long ago in applications themselves. Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating on unicode. And the way they do that is straightforward. Now currently what we have to do on Python 3 is to encode the data again and decode it with the target charset. Unnecessary roundtrips that just slow the whole thing down. What for? Regards, Armin From alan at xhaus.com Tue Sep 22 11:45:44 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 10:45:44 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB898DB.8080801@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> Message-ID: <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> [Armin] > Because that problem was solved a long ago in applications themselves. > Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating > on unicode. And the way they do that is straightforward. So what are we all discussing? Those frameworks obviously have solved all of the problems of decoding incoming request components, e.g. 1. SCRIPT_NAME 2. PATH_INFO 3. QUERY_STRING 4. Etc from miscellaneous unknown character sets into unicode, with out any mistakes, under all possible WSGI environments, e.g. 1. Mod_wsgi 2. Modjy (java servlets) 3. IIS 4. CGI 5. FCGI 6. Etc So why not just adopt one of those mechanisms, e.g. Django, and make it the de-facto standard? Since they all deliver unicode, python 3 is no longer a problem, since it permits only unicode strings. Alan. From armin.ronacher at active-4.com Tue Sep 22 12:00:15 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 12:00:15 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> Message-ID: <4AB8A02F.20506@active-4.com> Hi, Alan Kennedy schrieb: > from miscellaneous unknown character sets into unicode, with out any > mistakes, under all possible WSGI environments, e.g. No, they know the character sets. You tell them what character set you want to use. For example you can specify "utf-8", and they will decode/encode from/to utf-8. But there is no way for the application to send information to the server before they are invoked to tell the server what encoding they want to use. Regards, Armin From alan at xhaus.com Tue Sep 22 12:30:34 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 11:30:34 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB8A02F.20506@active-4.com> References: <4AB628C6.1000208@active-4.com> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> <4AB8A02F.20506@active-4.com> Message-ID: <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> [Armin] > No, they know the character sets. Hmmm, define "know" ;-) [Armin] > You tell them what character set you > want to use. For example you can specify "utf-8", and they will > decode/encode from/to utf-8. But there is no way for the application to > send information to the server before they are invoked to tell the > server what encoding they want to use. I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary, so that applications that have problems, i.e. that detect mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo the faulty decoding by the server. Alan. From renesd at gmail.com Tue Sep 22 11:26:30 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Tue, 22 Sep 2009 10:26:30 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> Message-ID: <64ddb72c0909220226j3b3cab4dsc638fee312160bfc@mail.gmail.com> On Tue, Sep 22, 2009 at 10:06 AM, Alan Kennedy wrote: > [Alan] >>> Is there a real need out there? > > [Armin] >> In python 3, yes. ?Because the stdlib no longer works with bytes and the >> bytes object has few string semantics left. > > Why can't we just do the same as the java servlet spec? I.E. > > 1. Ignore the encoding issues being discussed > 2. Give the programmer (possibly mojibake) unicode strings in the WSGI > environ anyway > 3. And let them solve their problems themselves, using server > configuration or bespoke middleware > > [Alan] >>> Java programmers just tolerate this, although they may curse the >>> developers of the servlet spec for not having solved their specific >>> problem for them. > > [Armin] >> Many Java apps are also still using latin1 only or have all kinds of >> problems with charsets. > > My point exactly. > > Many web developers simply never have to deal with these issues, > perhaps a majority. > > The ones that do have to sort it out for themselves. > > To do so, the publishers of the various containers give them > (non-standard) options to control the decoding of the incoming request > and all of its component parts: you cited the Tomcat approach above. > Other containers do it differently. Which means that i18n knowledge is > not portable between containers. > > It would be nice if we could avoid such a situation with i18n and WSGI. > > But I suppose I'm a little dubious that this group can out-do the > enormous java community, and the enormous financial resources that > Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve > this complex problem satisfactorily. > > Alan. I think it's worth discussing and working something out that's good (good in various ways). As this is a python group, I think most of us think python does a whole bunch of things better than java(maybe wrongly... but still) ;-) cu, From armin.ronacher at active-4.com Tue Sep 22 12:42:03 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 12:42:03 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> <4AB8A02F.20506@active-4.com> <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> Message-ID: <4AB8A9FB.6010904@active-4.com> Hi, Alan Kennedy schrieb: > Hmmm, define "know" ;-) The charset of incoming data, the charset of URLs, the charset of outgoing data, the charset of whatever the application uses, is what the application decides it to be. Most new applications go with utf-8 for everything these days. > I see this as being the same as Graham's suggested approach of a > per-server configurable charset, which is then stored in the WSGI > dictionary. SCRIPT_NAME and PATH_INFO are different because URLs as entered by the user will always be utf-8 in modern browsers. Even if the application decides to have latin1 URLs. Of course a server configuration variable would be a solution for many of these problems, but I don't like the idea of changing application behavior based on server configuration. At that point we will finally have successfully killed the idea of nested WSGI applications, because those could depend on different charsets. Regards, Armin From alan at xhaus.com Tue Sep 22 13:00:06 2009 From: alan at xhaus.com (Alan Kennedy) Date: Tue, 22 Sep 2009 12:00:06 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB8A9FB.6010904@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> <4AB8A02F.20506@active-4.com> <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> <4AB8A9FB.6010904@active-4.com> Message-ID: <4a951aa00909220400k415ed31obf51ef892072d8ac@mail.gmail.com> [Armin] > Of course a server configuration variable would be a solution for many > of these problems, but I don't like the idea of changing application > behavior based on server configuration. So you don't like the way that Django, Werkzeug, WebOb, etc, do it now, even though they appear to be mostly successful, and you're happy to cite them as such? >From the applications point of view, a framework-level configuration variable is the same as a server-level configuration variable. > At that point we will finally > have successfully killed the idea of nested WSGI applications, because > those could depend on different charsets. Wouldn't well-written applications depend on unicode? The server configured charset is simply an explicit statement of the character set from which incoming requests are to be decoded, into unicode, and no other character set. Alan. From armin.ronacher at active-4.com Tue Sep 22 13:12:04 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Tue, 22 Sep 2009 13:12:04 +0200 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220400k415ed31obf51ef892072d8ac@mail.gmail.com> References: <4AB628C6.1000208@active-4.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> <4AB8A02F.20506@active-4.com> <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> <4AB8A9FB.6010904@active-4.com> <4a951aa00909220400k415ed31obf51ef892072d8ac@mail.gmail.com> Message-ID: <4AB8B104.9050406@active-4.com> Hi, Alan Kennedy schrieb: > So you don't like the way that Django, Werkzeug, WebOb, etc, do it > now, even though they appear to be mostly successful, and you're happy > to cite them as such? Server != Application. > From the applications point of view, a framework-level configuration > variable is the same as a server-level configuration variable. It is not. I can configure my framework from within Python code, But I cannot change the webserver configuration from there. > Wouldn't well-written applications depend on unicode? Only internally. There is no such thing as Unicode in HTTP. Regards, Armin From renesd at gmail.com Tue Sep 22 13:18:24 2009 From: renesd at gmail.com (=?ISO-8859-1?Q?Ren=E9_Dudfield?=) Date: Tue, 22 Sep 2009 12:18:24 +0100 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB8B104.9050406@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB88AED.1010003@active-4.com> <4a951aa00909220206x64eb24b3xf93645c03d1c1a88@mail.gmail.com> <4AB898DB.8080801@active-4.com> <4a951aa00909220245t6edcb702nd0e54818de1244a2@mail.gmail.com> <4AB8A02F.20506@active-4.com> <4a951aa00909220330i45631b7cr6d8265107584b26d@mail.gmail.com> <4AB8A9FB.6010904@active-4.com> <4a951aa00909220400k415ed31obf51ef892072d8ac@mail.gmail.com> <4AB8B104.9050406@active-4.com> Message-ID: <64ddb72c0909220418r41460225k2fad1d37dd52652e@mail.gmail.com> On Tue, Sep 22, 2009 at 12:12 PM, Armin Ronacher wrote: > Hi, > > Alan Kennedy schrieb: >> So you don't like the way that Django, Werkzeug, WebOb, etc, do it >> now, even though they appear to be mostly successful, and you're happy >> to cite them as such? > Server != Application. > >> From the applications point of view, a framework-level configuration >> variable is the same as a server-level configuration variable. > It is not. ?I can configure my framework from within Python code, But I > cannot change the webserver configuration from there. > >> Wouldn't well-written applications depend on unicode? > Only internally. ?There is no such thing as Unicode in HTTP. > hi, other points I agree with... However, remember that there is unicode in HTTP these days. As per previous conversation on RFCs stating so... and real world use of unicode in HTTP. cheers, From mdipierro at cs.depaul.edu Tue Sep 22 15:43:53 2009 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Tue, 22 Sep 2009 08:43:53 -0500 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4AB88AED.1010003@active-4.com> References: <4AB628C6.1000208@active-4.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com> <20090921193128.18C623A407A@sparrow.telecommunity.com> <47958425-91EE-44AD-9A17-8DB39E329FEE@mnot.net> <20090922043314.4F4D53A415F@sparrow.telecommunity.com> <4AB88204.6080703@active-4.com> <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.com> <4AB88AED.1010003@active-4.com> Message-ID: Thank you Armin this makes things clear to me ( a newbie hre). On Sep 22, 2009, at 3:29 AM, Armin Ronacher wrote: > - my initial plan was going bytes everywhere. Turns out, on Python 3 > this is nearly impossible to do because the majority of the standard > library went an unicode path, even where bytes would be more > appropriate (like cgi.FieldStorage, urllib.parse etc.) I would have taken the same stand. > - Graham, Robert (and now me as well) try to get charset guessing for > URLs going, decide on latin1 for the HTTP headers. latin1 could be > re-decoded by the application if it really thinks it wanted utf-8 > for instance. (Like cookie headers, only I guess only there) If wsgi guesses the charset before will the application always be able to derive the original strings? > - One idea is enforcing unicode for all Python versions > > - One idea is going unicode for Python 3 and bytestrings for Python 2 For what it matters I prefer the latter option. From pje at telecommunity.com Tue Sep 22 15:55:37 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 22 Sep 2009 09:55:37 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.co m> References: <4AB628C6.1000208@active-4.com> <88e286470909211907k6c27dff1jff5ee9c8eef4d4ea@mail.gmail.com> <88e286470909211926s7209dc20ye87128625eda2cc@mail.gmail.com> <88e286470909212307k5ae41b55o67de897e10d35458@mail.gmail.com> <88e286470909212336t35276dd1h2cc99dc9c45527a8@mail.gmail.com> <3DEC42D4-CE77-4C21-9895-B683FA0973F3@mnot.net> <88e286470909212344x30eb471ewcbde61daf1858b7d@mail.gmail.com> Message-ID: <20090922135537.DC3823A4079@sparrow.telecommunity.com> At 04:44 PM 9/22/2009 +1000, Graham Dumpleton wrote: >2009/9/22 Mark Nottingham : > > That blog entry is eleven printed pages. Given that PEP 333 also prints as > > eleven pages from my browser, I suspect there's some extraneous information > > in there. > > > > Could you please summarise? Requiring all comers to read such a voluminous > > entry is a considerable (and somewhat arbitrary) bar to entry for the > > discussion. > >If you aren't willing to read the PEP to understand WSGI why are you >even wanting to participate in the discussion in the first place? This >is a quite detailed discussion about the future of the WSGI >specification and not an IRC channel manned by ticket monkeys. :-( Um, Graham, Mark was a major contributor to the original PEP. See: http://www.python.org/dev/peps/pep-0333/#acknowledgements I assure you, he's read the PEP quite thoroughly. ;-) From pje at telecommunity.com Tue Sep 22 16:01:07 2009 From: pje at telecommunity.com (P.J. Eby) Date: Tue, 22 Sep 2009 10:01:07 -0400 Subject: [Web-SIG] Request for Comments on upcoming WSGI Changes In-Reply-To: <4a951aa00909220123x5846cb79y8db05c9589b7c78d@mail.gmail.co m> References: <4AB628C6.1000208@active-4.com> <4AB77DDC.7030300@active-4.com> <64ddb72c0909211057y1299b788p8b141d8ce84f16c4@mail.gmail.com>