From casey at zope.com  Mon Dec  1 15:18:33 2003
From: casey at zope.com (Casey Duncan)
Date: Mon Dec  1 15:21:47 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike [was:
	Re: Python version...]
In-Reply-To: <Pine.LNX.4.58.0311302036320.1486@alice>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
Message-ID: <20031201151833.4b9004fa.casey@zope.com>

On Sun, 30 Nov 2003 21:13:54 +0000 (GMT)
John J Lee <jjl@pobox.com> wrote:

> On Sun, 30 Nov 2003, Stuart Langridge wrote:
> > John J Lee spoo'd forth:
> [...]
> > > Is this aimed at the standard library?  xml.dom.ext.reader.HtmlLib?
> [...]
> > Um. What I was looking for was something that could parse HTML
> > (including invalid HTML) and give me a DOM tree. I tried Twisted's
> 
> Fine, but what we're talking about here is what should go into Python's
> standard library.
> 
> [...]
> > I think
> > that a DOM parser for HTML is pretty important, even if that parser
> > *actually* just does "convert broken HTML to valid XHTML and then feed
> > it to minidom" or something similar. Are there any others?
> 
> There are lots of XML DOM implementations for Python (only one HTML DOM
> implementation, though: 4DOM -- and that's out of date), including the one
> that's already in the standard library.  Parsing arbitrary HTML is hard,
> though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
> DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
> the wild ;-).  tidylib is the only sane way I know of.  See below.

Hmmm, it sounds to me like implementing/updating the HTML parsing built into python is something worth considering if it blocks several other possible initiatives.

HTML may be on the way out, but I think we're stuck with it for the forseeable future.

-Casey (running away ;^)

From jjl at pobox.com  Mon Dec  1 15:55:47 2003
From: jjl at pobox.com (John J Lee)
Date: Mon Dec  1 15:55:58 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031201151833.4b9004fa.casey@zope.com>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles> <Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
Message-ID: <Pine.LNX.4.58.0312012052380.2217@alice>

On Mon, 1 Dec 2003, Casey Duncan wrote:
[...]
> Hmmm, it sounds to me like implementing/updating the HTML parsing built
> into python is something worth considering if it blocks several other
> possible initiatives.
[...]

Problems:

1. no volunteer to write a plain-old-C-API wrapper of tidylib

2. tidylib was still a moving target last time I looked (maybe by the time
   2.4 comes out, it will have settled down)...


John

From manfred.stienstra at dwerg.net  Mon Dec  1 17:28:41 2003
From: manfred.stienstra at dwerg.net (Manfred Stienstra)
Date: Mon Dec  1 17:29:48 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <Pine.LNX.4.58.0312012052380.2217@alice>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice> <3FC61716.90909@bath.ac.uk>
	<E1AQOMM-0001A4-00@giles> <Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles> <Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
Message-ID: <1070317721.2301.7.camel@ack.dwerg.net>

On Mon, 2003-12-01 at 21:55, John J Lee wrote:
> 1. no volunteer to write a plain-old-C-API wrapper of tidylib

Tidylib is written in C.

http://tidy.sourceforge.net/libintro.html

Manfred


From casey at zope.com  Mon Dec  1 17:36:27 2003
From: casey at zope.com (Casey Duncan)
Date: Mon Dec  1 17:39:46 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <Pine.LNX.4.58.0312012052380.2217@alice>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
Message-ID: <20031201173627.741b88f9.casey@zope.com>

On Mon, 1 Dec 2003 20:55:47 +0000 (GMT)
John J Lee <jjl@pobox.com> wrote:

> On Mon, 1 Dec 2003, Casey Duncan wrote:
> [...]
> > Hmmm, it sounds to me like implementing/updating the HTML parsing built
> > into python is something worth considering if it blocks several other
> > possible initiatives.
> [...]
> 
> Problems:
> 
> 1. no volunteer to write a plain-old-C-API wrapper of tidylib

I'll look into this, but I'll hold off volunteering until I see how big the API is. I suspect not very.

-Casey

From casey at zope.com  Tue Dec  2 00:11:12 2003
From: casey at zope.com (Casey Duncan)
Date: Tue Dec  2 00:12:23 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
References: <3FC4D804.70201@bath.ac.uk><Pine.LNX.4.58.0311270125360.3823@alice><3FC61716.90909@bath.ac.uk>
	<E1AQOMM-0001A4-00@giles><Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles><Pine.LNX.4.58.0311302036320.1486@alice><20031201151833.4b9004fa.casey@zope.com><Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
Message-ID: <001601c3b892$b4a00db0$6401a8c0@khatru>

> On Mon, 1 Dec 2003 20:55:47 +0000 (GMT)
> John J Lee <jjl@pobox.com> wrote:
[snip]
> > Problems:
> >
> > 1. no volunteer to write a plain-old-C-API wrapper of tidylib
>
> I'll look into this, but I'll hold off volunteering until I see how big
the API is. I suspect not very.

After looking at it I'd say it's certainly a non-trivial task to wrap (by
hand), depending on what the real needs are. Do we simply want a 1-to-1
(perhaps swigged) wrapper, do we want something pythonic, or what? The
latter is obviously more involved and would need much more discussion and
vetting, especially given its DOM-ish aspirations.

Perhaps the most reasonable approach would be to generate a simple low-level
wrapper first and then gradually develop a high-level interface to it,
mostly written in Python. That might also insulate us from future API
changes to tidy better.

-Casey


From sholden at holdenweb.com  Tue Dec  2 07:39:08 2003
From: sholden at holdenweb.com (Steve Holden)
Date: Tue Dec  2 07:43:51 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <001601c3b892$b4a00db0$6401a8c0@khatru>
Message-ID: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>

> -----Original Message-----
> From: web-sig-bounces+sholden=holdenweb.com@python.org
> [mailto:web-sig-bounces+sholden=holdenweb.com@python.org]On Behalf Of
> Casey Duncan
> Sent: Tuesday, December 02, 2003 12:11 AM
> To: Casey Duncan; John J Lee
> Cc: web-sig@python.org
> Subject: Re: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
>
>
> > On Mon, 1 Dec 2003 20:55:47 +0000 (GMT)
> > John J Lee <jjl@pobox.com> wrote:
> [snip]
> > > Problems:
> > >
> > > 1. no volunteer to write a plain-old-C-API wrapper of tidylib
> >
> > I'll look into this, but I'll hold off volunteering until I
> see how big
> the API is. I suspect not very.
>
> After looking at it I'd say it's certainly a non-trivial task
> to wrap (by
> hand), depending on what the real needs are. Do we simply
> want a 1-to-1
> (perhaps swigged) wrapper, do we want something pythonic, or what? The
> latter is obviously more involved and would need much more
> discussion and
> vetting, especially given its DOM-ish aspirations.
>
> Perhaps the most reasonable approach would be to generate a
> simple low-level
> wrapper first and then gradually develop a high-level interface to it,
> mostly written in Python. That might also insulate us from future API
> changes to tidy better.
>
I think we also want to consider seriously whether tidy is what we need.
Does it really provide a necessary function? And, even if it does, how
valuable would that function be? I wasn't impressed with tidy in either
of the two attempts I made to use it.

Then, of course, there's the question of prior art:

	http://www.lemburg.com/files/python/mxTidy.html

might be worth looking at before you go too much further ...

regards
--
Steve Holden          +1 703 278 8281        http://www.holdenweb.com/
Improve the Internet           http://vancouver-webpages.com/CacheNow/
Python Web Programming                http://pydish.holdenweb.com/pwp/
Interview with GvR August 14, 2003       http://www.onlamp.com/python/


From casey at zope.com  Tue Dec  2 09:16:52 2003
From: casey at zope.com (Casey Duncan)
Date: Tue Dec  2 09:19:29 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031201173627.741b88f9.casey@zope.com>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
Message-ID: <20031202091652.18f2daea.casey@zope.com>

On Mon, 1 Dec 2003 17:36:27 -0500
Casey Duncan <casey@zope.com> wrote:
[snip]
> I'll look into this, but I'll hold off volunteering until I see how big the API is. I suspect not very.

After looking at it I think it is reasonable to wrap. It looks to be designed with that in mind. Ironically it seems that mxTidy was an inspiration for tidylib, so wrapping it will bring it full circle.

I see the process going in two phases:

1. A low-level wrapper that exposes the C API directly, with only small pythonifications, like proper exception handling, simple type mapping, etc.

2. A high-level OO API specifically designed for use with Python.

I volunteer for phase 1. Actually I will do a phase 0 first which will just be stupid wrapper that exposes the API and nothing else. From there we can discuss what needs to be done to complete phase 1.

This looks like a good job for SWIG, does anyone oppose using it?

-Casey

From aquarius-lists at kryogenix.org  Tue Dec  2 09:54:22 2003
From: aquarius-lists at kryogenix.org (Stuart Langridge)
Date: Tue Dec  2 09:52:07 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
Message-ID: <E1ARBuw-0001mW-00@giles>

Steve Holden spoo'd forth:
> I think we also want to consider seriously whether tidy is what we need.
> Does it really provide a necessary function? And, even if it does, how
> valuable would that function be? I wasn't impressed with tidy in either
> of the two attempts I made to use it.

I don't see that tidy's ability to tidy HTML per se is useful, but I
think that it's very useful in that it can take invalid HTML and
convert it to valid XHTML. That way, we can get a DOM tree from invalid
HTML, which is very useful...

sil

-- 
"Willow hath gat hare off rede
 And doth geev soopurb heede.
 Buffy, as written by Geoffrey Chaucer, the dirty mediaeval git."
	   -- Andy Spencer, after Certic

From cs1spw at bath.ac.uk  Tue Dec  2 11:07:19 2003
From: cs1spw at bath.ac.uk (Simon Willison)
Date: Tue Dec  2 11:07:24 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <E1ARBuw-0001mW-00@giles>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles>
Message-ID: <3FCCB8B7.8070102@bath.ac.uk>

Stuart Langridge wrote:
> I don't see that tidy's ability to tidy HTML per se is useful, but I
> think that it's very useful in that it can take invalid HTML and
> convert it to valid XHTML. That way, we can get a DOM tree from invalid
> HTML, which is very useful...

Is there any way we could get a DOM tree from invalid HTML using pure 
Python tools? The HTML tools in the Python standard library at the 
moment are all pure Python. Could we even use the existing sgmllib 
module (or an extension of it) to create our own DOM tree from invalid HTML?


From aquarius-lists at kryogenix.org  Tue Dec  2 11:13:22 2003
From: aquarius-lists at kryogenix.org (Stuart Langridge)
Date: Tue Dec  2 11:11:07 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
Message-ID: <E1ARD9O-0002Ev-00@giles>

Simon Willison spoo'd forth:
> Stuart Langridge wrote:
>> I don't see that tidy's ability to tidy HTML per se is useful, but I
>> think that it's very useful in that it can take invalid HTML and
>> convert it to valid XHTML. That way, we can get a DOM tree from invalid
>> HTML, which is very useful...
> 
> Is there any way we could get a DOM tree from invalid HTML using pure 
> Python tools? The HTML tools in the Python standard library at the 
> moment are all pure Python. Could we even use the existing sgmllib 
> module (or an extension of it) to create our own DOM tree from invalid HTML?

Presumably we could (the existing things, like HtmlLib or microdom do
it); I was just thinking of not having to implement it if we didn't have
to :)
I'm not all that hot on sgmllib, either -- parsing invalid HTML strikes
me as being pretty hard, since browsers have to try hard to do it. I
don't know, however, if the hard thing is *displaying* it right rather
than just *parsing* it.
Thought: Grail was a browser, so it might have done it?

sil

-- 
2. Make it halfway normal. I don't have any use for
laser-beam-shooting pocket combs, or non-existent existents existing
within their own existences, or ballpoint pens made out of lettuce.
	   -- CardinalT dictates rules for the raif Silly Game

From casey at zope.com  Tue Dec  2 11:58:46 2003
From: casey at zope.com (Casey Duncan)
Date: Tue Dec  2 12:02:12 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <3FCCB8B7.8070102@bath.ac.uk>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
Message-ID: <20031202115846.52a37f1a.casey@zope.com>

On Tue, 02 Dec 2003 10:07:19 -0600
Simon Willison <cs1spw@bath.ac.uk> wrote:

> Stuart Langridge wrote:
> > I don't see that tidy's ability to tidy HTML per se is useful, but I
> > think that it's very useful in that it can take invalid HTML and
> > convert it to valid XHTML. That way, we can get a DOM tree from invalid
> > HTML, which is very useful...
> 
> Is there any way we could get a DOM tree from invalid HTML using pure 
> Python tools? The HTML tools in the Python standard library at the 
> moment are all pure Python. Could we even use the existing sgmllib 
> module (or an extension of it) to create our own DOM tree from invalid HTML?

According to the docs, tidylib exposes a DOM-like interface for walking the document tree of documents it has parsed. My understanding is that this is designed to work for broken HTML up to valid XHTML. If it works as advertised, it could be a good engine to put behind a nice python api.

See: http://tidy.sourceforge.net/docs/api/group__Tree.html

The API gets a bit verbose in places (separate functions to test for each tag and attribute type). These look like compliments to the generic functions, perhaps to avoid putting too much HTML knowledge directly in the user code.

Also, tidylib's memory allocation is hookable, in case we wanted to use Python's malloc/free (not sure whether we need to).

-Casey

From jjl at pobox.com  Tue Dec  2 14:35:36 2003
From: jjl at pobox.com (John J Lee)
Date: Tue Dec  2 14:35:44 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <3FCCB8B7.8070102@bath.ac.uk>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
Message-ID: <Pine.LNX.4.58.0312021933350.1391@alice>

On Tue, 2 Dec 2003, Simon Willison wrote:
[...]
> Is there any way we could get a DOM tree from invalid HTML using pure
> Python tools? The HTML tools in the Python standard library at the
[...]

No chance.  A lot of work has gone into HTMLTidy / tidylib, reimplementing
it would be a lot of work for little benefit.


John

From jjl at pobox.com  Tue Dec  2 14:37:45 2003
From: jjl at pobox.com (John J Lee)
Date: Tue Dec  2 14:37:52 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <E1ARD9O-0002Ev-00@giles>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
	<E1ARD9O-0002Ev-00@giles>
Message-ID: <Pine.LNX.4.58.0312021936030.1391@alice>

On Tue, 2 Dec 2003, Stuart Langridge wrote:

> Simon Willison spoo'd forth:
[...]
> > Is there any way we could get a DOM tree from invalid HTML using pure
> > Python tools? The HTML tools in the Python standard library at the
[...]
> Presumably we could (the existing things, like HtmlLib or microdom do
> it);
[...]

No, they don't.  There's a whole wonderful world <wink> of invalid HTML
out there, that sgmllib and xml.dom.ext.reader.HtmlLib know nothing about.


John

From jjl at pobox.com  Tue Dec  2 14:39:10 2003
From: jjl at pobox.com (John J Lee)
Date: Tue Dec  2 14:39:17 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
Message-ID: <Pine.LNX.4.58.0312021920590.1391@alice>

[...]
> > wrapper first and then gradually develop a high-level interface to it,
> > mostly written in Python. That might also insulate us from future API
> > changes to tidy better.
> >
> I think we also want to consider seriously whether tidy is what we need.
> Does it really provide a necessary function? And, even if it does, how
> valuable would that function be?

Parsing arbitrary (including broken) HTML reliably.  Processing that HTML
with XML tools.

Whether that's "necessary" or valuable is a matter for debate, obviously.


> I wasn't impressed with tidy in either
> of the two attempts I made to use it.
>
> Then, of course, there's the question of prior art:
>
> 	http://www.lemburg.com/files/python/mxTidy.html
>
> might be worth looking at before you go too much further ...

mxTidy and tidylib are based on the same code (HTMLTidy).  tidylib is
being actively maintained (though that may be a mixed blessing, depending
on the relative proportions of old and newly-introduced bugs).


John

From jjl at pobox.com  Tue Dec  2 14:44:10 2003
From: jjl at pobox.com (John J Lee)
Date: Tue Dec  2 14:44:16 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031202091652.18f2daea.casey@zope.com>
References: <3FC4D804.70201@bath.ac.uk>
	<Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles> <Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
Message-ID: <Pine.LNX.4.58.0312021939230.1391@alice>

On Tue, 2 Dec 2003, Casey Duncan wrote:
[...]
> 1. A low-level wrapper that exposes the C API directly, with only small
> pythonifications, like proper exception handling, simple type mapping,
> etc.
>
> 2. A high-level OO API specifically designed for use with Python.
>
> I volunteer for phase 1. Actually I will do a phase 0 first which will
> just be stupid wrapper that exposes the API and nothing else. From there
> we can discuss what needs to be done to complete phase 1.

Great!

Maybe it's worth bouncing the idea off python-dev first, though, in case
it gets ruled out by the BDFL (unlikely, I suspect, but I don't know).
Unless you want it regardless of whether it's in the library, of course.


> This looks like a good job for SWIG, does anyone oppose using it?

That sounds like another question for python-dev.


John

From gward at python.net  Tue Dec  2 22:28:03 2003
From: gward at python.net (Greg Ward)
Date: Tue Dec  2 22:28:08 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031202091652.18f2daea.casey@zope.com>
References: <Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
Message-ID: <20031203032803.GA2473@cthulhu.gerg.ca>

On 02 December 2003, Casey Duncan said:
> I volunteer for phase 1. Actually I will do a phase 0 first which will
> just be stupid wrapper that exposes the API and nothing else. From
> there we can discuss what needs to be done to complete phase 1.
> 
> This looks like a good job for SWIG, does anyone oppose using it?

Note that the current Berkeley DB wrapper did not get into the standard
library until AMK rewrote it from hand with no hint of SWIG.  (And even
then, it took a year or two before the bsddb in 2.3 got in.)

As I recall, there were Serious Reservations about the quality of code
generated by SWIG.  Grovel through the python-dev archives for more.  If
SWIG has changed much since then, it might be worth revisiting -- but I
suspect you'd have a selling job to do to get SWIGged code past
python-dev.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Don't hate yourself in the morning -- sleep till noon.

From casey at zope.com  Tue Dec  2 23:03:03 2003
From: casey at zope.com (Casey Duncan)
Date: Tue Dec  2 23:05:20 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031203032803.GA2473@cthulhu.gerg.ca>
References: <Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
	<20031203032803.GA2473@cthulhu.gerg.ca>
Message-ID: <20031202230303.0052c52e.casey@zope.com>

On Tue, 2 Dec 2003 22:28:03 -0500
Greg Ward <gward@python.net> wrote:

> On 02 December 2003, Casey Duncan said:
> > I volunteer for phase 1. Actually I will do a phase 0 first which will
> > just be stupid wrapper that exposes the API and nothing else. From
> > there we can discuss what needs to be done to complete phase 1.
> > 
> > This looks like a good job for SWIG, does anyone oppose using it?
> 
> Note that the current Berkeley DB wrapper did not get into the standard
> library until AMK rewrote it from hand with no hint of SWIG.  (And even
> then, it took a year or two before the bsddb in 2.3 got in.)

And it still seems to break often due to the API instabilities of bsddb itself. Oh well.

> As I recall, there were Serious Reservations about the quality of code
> generated by SWIG.  Grovel through the python-dev archives for more.  If
> SWIG has changed much since then, it might be worth revisiting -- but I
> suspect you'd have a selling job to do to get SWIGged code past
> python-dev.

Yup, I have reservations of my own about it. I definitely don't want to do it by hand (and maintain it) if it will see little use, so I think we should discuss a bit more exactly what our needs are.

>From what I understand we want a DOM parser for real-world (aka broken) HTML code. From what I can see, tidylib will (or at least aspires to) do this. I think some testing is in order, now if only I could find some broken HTML code... ;^)

Now the DOM api from tidylib is not W3C compliant. If we were to use tidylib as a base for some new HTML DOM parser, would we desire a W3C compliant api? As much as I want to say no, it would probably help its credibility in terms of becoming part of the st lib.

OTOH, if anyone has a better idea, I'm all ears. What kind of api do people want?

So a revised plan A will be to vet tidylib as the solution to the HTML parser problem. I will do this, but can anyone already speak more specifically about their experiences good and bad?

-Casey


From aquarius-lists at kryogenix.org  Wed Dec  3 05:01:34 2003
From: aquarius-lists at kryogenix.org (Stuart Langridge)
Date: Wed Dec  3 04:59:12 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
	<E1ARD9O-0002Ev-00@giles> <Pine.LNX.4.58.0312021936030.1391@alice>
Message-ID: <E1ARTp8-0008T1-00@giles>

John J Lee spoo'd forth:
> On Tue, 2 Dec 2003, Stuart Langridge wrote:
>> Simon Willison spoo'd forth:
>> > Is there any way we could get a DOM tree from invalid HTML using pure
>> > Python tools? The HTML tools in the Python standard library at the
>> Presumably we could (the existing things, like HtmlLib or microdom do
>> it);
> 
> No, they don't.  There's a whole wonderful world <wink> of invalid HTML
> out there, that sgmllib and xml.dom.ext.reader.HtmlLib know nothing about.

Really? What sort of thing do they fail to parse?

sil

-- 
If hard data were the filtering criterion you could fit the entire
contents of the Internet on a floppy disk.
	   -- Cecil Adams

From jjl at pobox.com  Wed Dec  3 09:20:02 2003
From: jjl at pobox.com (John J Lee)
Date: Wed Dec  3 09:20:34 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <E1ARTp8-0008T1-00@giles>
References: <CGECIJPNNHIFAJKHOLMAOEMPJIAA.sholden@holdenweb.com>
	<E1ARBuw-0001mW-00@giles> <3FCCB8B7.8070102@bath.ac.uk>
	<E1ARD9O-0002Ev-00@giles>
	<Pine.LNX.4.58.0312021936030.1391@alice> <E1ARTp8-0008T1-00@giles>
Message-ID: <Pine.LNX.4.58.0312031322550.423@alice>

On Wed, 3 Dec 2003, Stuart Langridge wrote:
> John J Lee spoo'd forth:
> > On Tue, 2 Dec 2003, Stuart Langridge wrote:
> >> Simon Willison spoo'd forth:
> >> > Is there any way we could get a DOM tree from invalid HTML using pure
> >> > Python tools? The HTML tools in the Python standard library at the
> >> Presumably we could (the existing things, like HtmlLib or microdom do
> >> it);
> >
> > No, they don't.  There's a whole wonderful world <wink> of invalid HTML
> > out there, that sgmllib and xml.dom.ext.reader.HtmlLib know nothing about.
>
> Really? What sort of thing do they fail to parse?

Hmm, I thought microdom used tidylib, but it seems not.  Haven't tried
that yet.  The problem is that tidylib has had a lot of input over many
years from people reporting bugs (where "bug" is very widely defined to
include failing to understand all kinds of bad HTML that one wouldn't
imagine people would write or browsers would put up with).  microdom
hasn't.  But maybe it works well enough.  It's not a full DOM
implementation, though.

BTW, I had thought of tidylib simply as a way of transforming HTML into
valid HTML or XHTML, not as a DOM implementation.  You could just have a
single tidy() function (like mxTidy, IIRC).

Here's some valid HTML that xml.dom.ext.reader.HtmlLib (from PyXML, and
based on sgmlop) fails to parse.

#!/usr/bin/env python

# Example from Martin v. Loewis (PyXML SF bug 409605).
# The missing optional <body> tag is not inferred.
good_html = """
 <html>
 <P>I prefer (all things being equal)
 regularity/orthogonality and logical
 syntax/semantics in a language because there is less to
 have to remember.
 (Of course I <em>know</em> all things are NEVER really
 equal!)
 <P CLASS=source>Guido van Rossum, 6 Dec 91
 <P>The details of that silly code are irrelevant.
 <P CLASS=source>Tim Peters, 4 Mar 92
 &amp; &lt; &gt; &eacute; &ouml; &nbsp;
 </html>
 """

from xml.dom.ext.reader.HtmlLib import FromHtml
from xml.dom.ext import XHtmlPrettyPrint

dom = FromHtml(good_html)
XHtmlPrettyPrint(dom)


That could be fixed.  Nobody has, probably because there are better XML
DOM parsers.

IIRC HTMLParser still doesn't handle CDATA properly (this one has annoyed
a lot of people, but I don't think anybody has fixed it yet).

For invalid HTML, it's true that badly-matched tags tend to work OK with
HTMLParser, but of course that just gives you "bad callbacks" instead of
bad HTML, if you get what I mean -- if you want to build a DOM out of
that, for example, good luck.  I suppose this is really the most important
issue.

Browsers seem to be full of code to parse or ignore the weirdest stuff
that even the underlying parser (HTMLParser, etc) choke on: I've seen
things that look like SGML declarations <!...> but didn't even seem to be
valid SGML, let alone HTML (but I don't know SGML).


John

From jjl at pobox.com  Wed Dec  3 09:23:00 2003
From: jjl at pobox.com (John J Lee)
Date: Wed Dec  3 09:23:23 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031202230303.0052c52e.casey@zope.com>
References: <Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk>
	<E1AQOMM-0001A4-00@giles> <Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles> <Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
	<20031203032803.GA2473@cthulhu.gerg.ca>
	<20031202230303.0052c52e.casey@zope.com>
Message-ID: <Pine.LNX.4.58.0312031420390.423@alice>

On Tue, 2 Dec 2003, Casey Duncan wrote:
[...]
> OTOH, if anyone has a better idea, I'm all ears. What kind of api do people want?
[...]

from tidy import tidy
xhtml = tidy(html)


...plus some optional args.  mxTidy does this, more-or-less, I think (but
is based on the old HTMLTidy, not tidylib, of course).


John

From casey at zope.com  Wed Dec  3 10:08:59 2003
From: casey at zope.com (Casey Duncan)
Date: Wed Dec  3 10:12:26 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <Pine.LNX.4.58.0312031420390.423@alice>
References: <Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk> <E1AQOMM-0001A4-00@giles>
	<Pine.LNX.4.58.0311301418020.568@alice> <E1AQY9A-0006Wp-00@giles>
	<Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
	<20031203032803.GA2473@cthulhu.gerg.ca>
	<20031202230303.0052c52e.casey@zope.com>
	<Pine.LNX.4.58.0312031420390.423@alice>
Message-ID: <20031203100859.5c748589.casey@zope.com>

On Wed, 3 Dec 2003 14:23:00 +0000 (GMT)
John J Lee <jjl@pobox.com> wrote:

> On Tue, 2 Dec 2003, Casey Duncan wrote:
> [...]
> > OTOH, if anyone has a better idea, I'm all ears. What kind of api do people want?
> [...]
> 
> from tidy import tidy
> xhtml = tidy(html)

That would be a pretty easy wrapper methinks. At first that was pretty much all I thought tidylib would do, but it exposes its object model in such a way that you could parse HTML directly to a DOM if you wanted to.

If you merely use tidy to create xhtml and then parse that, you are doing a DOM parse twice and not only is that inefficient, its probably lossy (depending on how strict the conversion is). Cycles are cheap so I'm willing to live with inefficency if it means forward progress in functionality. The loss part might not be so great.

So maybe the approach should be:

1. Expose the basic functionality that the tidy binary has as a python function and see how we like it. I think this is worthwhile regardless of whether it makes it into the stdlib.

2. Think about whether we want/need a direct HTML->DOM parser. And then decide how much we need it 8^)

3. Go get a beer and think about something entirely different.

-Casey


From jjl at pobox.com  Wed Dec  3 10:40:58 2003
From: jjl at pobox.com (John J Lee)
Date: Wed Dec  3 10:41:04 2003
Subject: [Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
In-Reply-To: <20031203100859.5c748589.casey@zope.com>
References: <Pine.LNX.4.58.0311270125360.3823@alice>
	<3FC61716.90909@bath.ac.uk>
	<E1AQOMM-0001A4-00@giles> <Pine.LNX.4.58.0311301418020.568@alice>
	<E1AQY9A-0006Wp-00@giles> <Pine.LNX.4.58.0311302036320.1486@alice>
	<20031201151833.4b9004fa.casey@zope.com>
	<Pine.LNX.4.58.0312012052380.2217@alice>
	<20031201173627.741b88f9.casey@zope.com>
	<20031202091652.18f2daea.casey@zope.com>
	<20031203032803.GA2473@cthulhu.gerg.ca>
	<20031202230303.0052c52e.casey@zope.com>
	<Pine.LNX.4.58.0312031420390.423@alice>
	<20031203100859.5c748589.casey@zope.com>
Message-ID: <Pine.LNX.4.58.0312031534040.1290@alice>

On Wed, 3 Dec 2003, Casey Duncan wrote:
> On Wed, 3 Dec 2003 14:23:00 +0000 (GMT) John J Lee <jjl@pobox.com> wrote:
[...]
> > from tidy import tidy
> > xhtml = tidy(html)
>
> That would be a pretty easy wrapper methinks. At first that was pretty
> much all I thought tidylib would do, but it exposes its object model in
> such a way that you could parse HTML directly to a DOM if you wanted to.

Loss is inevitable if you're tidying.  How could it be otherwise?

Usually you don't get huge DOMs from HTML documents, unlike XML, so that's
not a major problem -- I hope!  Marc-Andre's page talks about poor
performance from HTMLTidy due to character-based operation, but I don't
know how severe that is or whether it's been addressed in tidylib.

4DOM seems damn slow (I may be unfairly blaming 4DOM, since I'm using a
hacked version with JavaScript interpretation on top, so it could easily
be my fault, or the fault of the JS code I'm running), but of course there
are faster, more compliant implementations, so that shouldn't be a
problem.

Finally, DOM *processing* might well be faster using tidylib just as a
tidier than it would be as a DOM (especially if you wrap the tidy-DOM to
get a real, compliant, DOM).


John

From pje at telecommunity.com  Sun Dec  7 13:53:43 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Sun Dec  7 13:51:39 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
Message-ID: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>

Your comments and feedback are requested.  Thanks.


PEP: XXX
Title: Python Web Container Interface v1.0
Version: $Revision: 1.1 $
Last-Modified: $Date: 2003/12/07 13:29:50 $
Author: Phillip J. Eby <pje@telecommunity.com>
Discussions-To: Python Web-SIG <web-sig@python.org>
Status: Draft
Type: Informational
Content-Type: text/x-rst
Created: 07-Dec-2003
Post-History: 07-Dec-2003


Abstract
========

This document specifies a proposed standard interface between
web applications and web application "containers" implemented
in Python, making it possible to use a variety of application
frameworks with a single container, and to use a variety of
containers with a single application.


Rationale
=========

Python currently boasts a wide variety of web frameworks,
such as Zope, Quixote, Webware, Skunkware, PSO, and Twisted --
to name just a few [1]_.

This wide range of available choices would not be a
problem, if only it weren't necessary to choose between them!
Because few Python web frameworks can interoperate in the same
process, users are generally forced to select one and only one
framework.

Making matters worse, not all frameworks support the same
launching mechanisms.  Some use an embedded webserver, others
use CGI, FastCGI, or some custom server-to-application
protocol.  But, it is quite rare for a single framework to
provide built-in support for all of these methods.

Thus, the launching mechanism, or "container", becomes a key
constraint for users selecting a web development tool.  They
are limited to the frameworks that support (or can be made to
support) their desired runtime environment.  This can narrow
the field of choices considerably.

This is a problem for framework authors as well as for
framework users.  For their framework to become popular,
the author must at least implement container mechanisms for
the most popular runtime environments.  Although container
implementation is not complex, it is tedious and sometimes
riddled with platform-specific issues.  Being able to
separate container development from framework development
would therefore benefit framework developers as well as users.

This PEP, therefore, proposes a simple and universal interface
between web "containers" and web "applications".  The proposed
interface is 100% framework neutral, and does not favor any
development style over any other.  Conformance to this interface
will permit framework-neutral containers to be developed,
independently of any application framework, and any application
framework will then be usable with any container (potentially
subject to certain environmental issues such as threading
support).

Finally, the interface also makes it potentially possible
to combine the use of multiple web framework tools in a single
application container.


Specification Overview
======================

A "container" is a mechanism for executing Python code in
response to a request made on a web server.  The mechanism
by which this occurs is specific to the container.  For example, a CGI
container would use the Common Gateway Interface, while a mod_python
container would use Apache's internal API.

An "application" is a Python object that does useful work in response
to a request made on a Web server.  A container invokes an application
by calling its ``runCGI`` method, whose signature is defined as
follows (the ``self`` argument is omitted for clarity.)::

     def runCGI(input,output,errors,environ):
         pass

In other words, an application calls
``app.runCGI(input,output,errors,environ)`` to invoke the application.
The ``runCGI`` method should read from ``input``, if required, and
write its response to ``output``, using the ``environ`` dictionary
to obtain other information about the request.  Error messages or log
output may be written to ``errors``.  The return value of ``runCGI``
is ignored by the container.  The contents and format of ``input``,
``output``, and ``environ`` are defined by the Common Gateway
Interface [2]_.

The application object *must* support repeated calls to
``runCGI``, as virtually all containers will make such repeated
requests.  Containers *should* trap and log exceptions raised by
applications, and *may* continue to execute, or attempt to shut down
gracefully.  Applications *should* avoid allowing exceptions to
escape their ``runCGI`` method, since the precise effect of this is
container-dependent.

Thread support, or lack thereof, is also container-dependent.
Containers that can run multiple requests in parallel, *should* also
provide the option of running an application in a single-threaded
fashion, so that applications or frameworks that are not thread-safe
may still be used.

This specification does not define how a container selects or
obtains an application to invoke.  These and other configuration
options are highly container-specific matters.  It is expected that
container authors will document how to configure the container to
execute a particular application object, and with what options (such
as threading options, if applicable).

Framework authors, on the other hand, will document how to create an
application object that wraps their framework's functionality.  The
user, who has chosen both the container and the application framework,
must connect the two together.  However, since both the framework and
the container now have a common interface, this should now be merely
a mechanical matter, rather than a significant engineering effort.


Specification Details
=====================

The ``input``, ``output``, and ``errors`` objects supplied to the
``runCGI`` method must be "file-like" objects, while the ``environ``
object *must* be a Python dictionary.  The ``runCGI`` method is
allowed to modify the dictionary in-place, making it easier for
authors to create simple "routing" components that forward ``runCGI``
calls to other components.

The rationale for requiring a dictionary is to maximize portability
between containers.  The alternative would be to define here some
subset of a dictionary's methods as being the standard and portable
interface.  In practice, however, most containers will probably want
to use a simple dictionary anyway, and some frameworks may end up
relying upon the fact that most containers do this.  So, in the
interest of a simple specification, and because there is little need
for a custom type here anyway, a Python dictionary is mandatory for
communicating the CGI environment.

The "file-like" objects are another matter, though.  There is much
more diversity between containers as to how these "file-like objects"
are likely to be implemented.  They may be pipes, or sockets, or
buffered asynchronous communication objects of some kind.  Therefore,
we must define the following subset of file methods that containers
are required to provide, and urge framework authors to use these,
and only these methods:

===================  =====================  ========
Method               Files                  Notes
===================  =====================  ========
``close()``          All
``read(size)``       ``input``
``readline()``       ``input``              1
``readlines(hint)``  ``input``              2
``__iter__()``       ``input``
``flush()``          ``output``,``errors``  3
``write(str)``       ``output``,``errors``
``writelines(seq)``  ``output``,``errors``
===================  =====================  ========

The semantics of each method are as documented in the Python Library
Reference, except for these notes as listed in the table above:

1. The optional "size" argument to ``readline()`` is not supported, as
    it may be complex for container authors to implement, and is not
    often used in practice.

2. Note that the ``hint`` argument to ``readlines()`` is optional for
    both caller and implementer.  The application author is free not
    to supply it, and the container author is free to ignore it.

3. Since ``output`` and ``errors`` may not be rewound, a container is
    free to forward write operations immediately, without buffering.
    In this case, the ``flush()`` method may be a no-op.  Portable
    applications, however, cannot assume that output is unbuffered
    or that ``flush()`` is a no-op.  They must call ``flush()`` if they
    need to ensure that output has in fact been written.

    Luckily, the use of ``output.flush()`` is only an issue for
    applications performing "server push" operations, since closing
    ``output`` will also flush it.  Applications writing logs or other
    output to ``errors``, however, may wish to perform a flush after
    each complete item is output, to minimize intermingling of data
    from multiple processes writing to the same log.

The methods listed in the table above *must* be supported by all
containers conforming to this specification.  Applications conforming
to this specification *must not* use any other methods or attributes
of the ``input``, ``output``, or ``errors`` objects.


Implementation and Application Notes
====================================

Proofs-of-concept of this specification are currently available in the
PEAK application framework [3]_.  PEAK includes a CGI container and
two FastCGI containers, as well as a sample non-framework application,
and a ``peak.web`` framework application.  Together, these components
demonstrate the ability to mix and match containers and
applications/frameworks by way of the interface specified here.
(Note: the containers and applications were implemented prior to the
creation of this specification, and so should not be taken as examples
of conforming implementations at this time.)

It is expected that future versions of Python will include updated
versions of current "containers" so that they can support this
interface.  For example, the Python standard library now contains
various web server implementations, and these could be modified to
allow invoking application objects that conform to this specification.

Widespread adoption of this specification would also make it possible
to implement simple "router" applications that forward ``runCGI``
calls to other application objects, using information in the
``environ`` to determine the recipient.

Because the CGI environment variables include both URL path
information and cookies, such "router" components could be very
sophisticated, if desired.  And, they would potentially allow more
than one framework to be used in the same application, permitting
Python developers to take the best from all possible worlds.

For load balancing and remote processing, it would also be possible
to write "bridge" applications, that forward a ``runCGI`` call over
a network.  Or, to add CGI capability to a Python webserver, one might
write a bridge that simply invoked another process in response to
``runCGI``.  Such bridges would again be usable in any container
conforming to this specification.


References
==========

.. [1] The Python Wiki "Web Programming" topic
    (http://www.python.org/cgi-bin/moinmoin/WebProgramming)

.. [2] The Common Gateway Interface Specification
    (http://hoohoo.ncsa.uiuc.edu/cgi/interface.html)

.. [3] PEAK: The Python Enterprise Application Kit
    (http://peak.telecommunity.com/)


Copyright
=========

This document has been placed in the public domain.


..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    End:


From amk at amk.ca  Sun Dec  7 16:35:21 2003
From: amk at amk.ca (A.M. Kuchling)
Date: Sun Dec  7 16:35:46 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
Message-ID: <20031207213521.GB19481@rogue.amk.ca>

On Sun, Dec 07, 2003 at 01:53:43PM -0500, Phillip J. Eby wrote:
> to a request made on a Web server.  A container invokes an application
> by calling its ``runCGI`` method, whose signature is defined as

Name nit: why include the irrelevant 'CGI' in the name?  Just 'run()' would
be fine.

> Containers that can run multiple requests in parallel, *should* also
> provide the option of running an application in a single-threaded
> fashion, so that applications or frameworks that are not thread-safe
> may still be used.

Should there also be a is_thread_safe() method that returns a Boolean, 
so containers can serialize if necessary?  

> The rationale for requiring a dictionary is to maximize portability
> between containers.  The alternative would be to define here some
> subset of a dictionary's methods as being the standard and portable
> interface.  In practice, however, most containers will probably want

Note that the UserDict.DictMixin class implements all of the other
dictionary methods as long as you implement __getitem__, __setitem__,
__delitem__, and keys().  It seems unpythonic to require a particular class
here.

The spec looks very good, though -- simple, easy to implement, and useful.

--amk

From pje at telecommunity.com  Sun Dec  7 19:05:26 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Sun Dec  7 19:03:22 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031207213521.GB19481@rogue.amk.ca>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
Message-ID: <5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>

At 04:35 PM 12/7/03 -0500, A.M. Kuchling wrote:
>On Sun, Dec 07, 2003 at 01:53:43PM -0500, Phillip J. Eby wrote:
> > to a request made on a Web server.  A container invokes an application
> > by calling its ``runCGI`` method, whose signature is defined as
>
>Name nit: why include the irrelevant 'CGI' in the name?  Just 'run()' would
>be fine.

Well, if you're going to go that route, why not just make it a callable?  :)

My thought here was that many kinds of Python frameworks have objects with 
'run' methods, and they all have different signatures.  So, explicit being 
better than implicit, I chose a name that was midway between a nameless 
callable and, say, 'executeWebRequest'.  :)

I'm not too strongly attached to the name, but would like to keep it a bit 
more explicit than 'run()' or a bare callable.


> > Containers that can run multiple requests in parallel, *should* also
> > provide the option of running an application in a single-threaded
> > fashion, so that applications or frameworks that are not thread-safe
> > may still be used.
>
>Should there also be a is_thread_safe() method that returns a Boolean,
>so containers can serialize if necessary?

I thought about it.  But there are going to be more applications than 
containers, so why put extra burden on the app side to benefit the few 
containers that will be threaded?  My conclusion (which others might not 
share) was that such containers are going to need other per-app 
configuration settings anyway, like perhaps the path at which the app is 
located, how many threads maximum to use in a thread pool for that app, and 
of course how to get the app object in the first place.  Thus, there's 
little added burden for the container to require explicit configuration for 
threadedness.  It's also possible that what constitutes thread-safety might 
vary somewhat from container to container.

Second, if container configuration becomes complex, there's always the 
possibility to go back and create some kind of "deployment descriptor" 
spec, to make apps deployable in a variety of containers.  But I think that 
should wait until there's enough field experience with *this* spec, to know 
what's really needed for the deployment spec.

And last, but far from least, the more things there are in the spec, the 
more things there are for people to disagree with or have different 
interpretations of.  :)


> > The rationale for requiring a dictionary is to maximize portability
> > between containers.  The alternative would be to define here some
> > subset of a dictionary's methods as being the standard and portable
> > interface.  In practice, however, most containers will probably want
>
>Note that the UserDict.DictMixin class implements all of the other
>dictionary methods as long as you implement __getitem__, __setitem__,
>__delitem__, and keys().  It seems unpythonic to require a particular class
>here.

Maybe I'm overreacting to being burned by imperfect dictionary simulations 
in the past.  OTOH, I noticed you haven't actually given a use case for 
*not* using a dictionary.  :)

However, there is ample precedent in Python for requiring at least a 
*subclass* of dictionary, and perhaps we could compromise there.


>The spec looks very good, though -- simple, easy to implement, and useful.

Thanks.  I've often found this "plumbing" issue to be quite annoying.  I 
know that I personally would likely experiment with more web app 
frameworks, if I knew that I could plug them into a container I was already 
familiar with.  And, I recently finished developing a very nice 
multiprocess FastCGI container which I expect to be my main runtime 
environment for web applications in future.  I don't want it to only be 
useful for myself and other PEAK users, though.  Hence, the spec.


From stuart at stuartbishop.net  Sun Dec  7 22:20:28 2003
From: stuart at stuartbishop.net (Stuart Bishop)
Date: Sun Dec  7 22:21:11 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
Message-ID: <7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On 08/12/2003, at 5:53 AM, Phillip J. Eby wrote:

> In other words, an application calls
> ``app.runCGI(input,output,errors,environ)`` to invoke the application.
> The ``runCGI`` method should read from ``input``, if required, and
> write its response to ``output``, using the ``environ`` dictionary
> to obtain other information about the request.  Error messages or log
> output may be written to ``errors``.  The return value of ``runCGI``
> is ignored by the container.  The contents and format of ``input``,
> ``output``, and ``environ`` are defined by the Common Gateway
> Interface [2]_.

Should environ['REMOTE_USER'] return '', None, or raise a KeyError if 
the
web server has performed no authentication on a request? Some keys
should always have valid values available (REQUEST_METHOD), but others
only for some requests (CONTENT_LENGTH, REMOTE_USER). We don't want
applications raising KeyError exceptions when moved to different
frameworks because of frameworks handling this differently. +1 for using
None for missing/meaningless value, and accessing any variable defined 
at
http://hoohoo.ncsa.uiuc.edu/cgi/env.html will never raise a KeyError.

I don't think errors should be a file - we now have a logging package
so we might as well use it. We could pass in a Logger instance, although
I'd just scrap the argument and let the handler instantiate the Logger
if it wants one. The container could define a Handler that sends the 
log messages to the 'standard' location (eg. CGI's Handler would just 
be a
StreamHandler that uses sys.stderr).

> Thread support, or lack thereof, is also container-dependent.
> Containers that can run multiple requests in parallel, *should* also
> provide the option of running an application in a single-threaded
> fashion, so that applications or frameworks that are not thread-safe
> may still be used.

A thread_safety method should be provided by the application. It should
be specified only once, rather than in every container that invokes the
application. The thread_level might be generated programatically, eg.
by querying a DB-API database Connection's thread_safety attribute.

> The rationale for requiring a dictionary is to maximize portability
> between containers.  The alternative would be to define here some
> subset of a dictionary's methods as being the standard and portable
> interface.  In practice, however, most containers will probably want
> to use a simple dictionary anyway, and some frameworks may end up
> relying upon the fact that most containers do this.  So, in the
> interest of a simple specification, and because there is little need
> for a custom type here anyway, a Python dictionary is mandatory for
> communicating the CGI environment.

Or just 'environment should be a standard mapping, or subclass of 
``map``
or ``UserDict``.

- --  Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/


- --  
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)

iD8DBQE/0+4DAfqZj7rGN0oRAlphAJ9wEqZt835o4IDl2QjnBvTVT8X2BwCePFCG
qMqU+BCwk8aZKMNKBt5Qc3M=
=yAPk
-----END PGP SIGNATURE-----


From ngps at netmemetic.com  Sun Dec  7 22:36:40 2003
From: ngps at netmemetic.com (Ng Pheng Siong)
Date: Sun Dec  7 22:36:32 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
Message-ID: <20031208033640.GA825@vista.netmemetic.com>

On Mon, Dec 08, 2003 at 02:20:28PM +1100, Stuart Bishop wrote:
> Should environ['REMOTE_USER'] return '', None, or raise a KeyError if 
> the
> web server has performed no authentication on a request? 

+1 for None.

Zope is able to use REMOTE_USER if the web server sets it, e.g.,

- ZServerSSL sets it to the client certificate's subject DN when available
  and asked to.

- The RemoteUserFolder product was originally written to allow IIS to
  do Windows authentication.

> accessing any variable defined at
> http://hoohoo.ncsa.uiuc.edu/cgi/env.html will never raise a KeyError.

For HTTPS there are a bunch of additional variables. I suppose most people
might consider mod_ssl's list canonical; I looked at it and copped out:
ZServerSSL exports only SSL_CIPHER for now.


-- 
Ng Pheng Siong <ngps@netmemetic.com> 

http://firewall.rulemaker.net     -+- All Your Rulebase Are Belong To You[tm]
http://sandbox.rulemaker.net/ngps -+- Open Source Python Crypto & SSL

From stuart at stuartbishop.net  Sun Dec  7 22:50:22 2003
From: stuart at stuartbishop.net (Stuart Bishop)
Date: Sun Dec  7 22:50:53 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
Message-ID: <A636879C-2931-11D8-A22F-000A95A06FC6@stuartbishop.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On 08/12/2003, at 11:05 AM, Phillip J. Eby wrote:

> At 04:35 PM 12/7/03 -0500, A.M. Kuchling wrote:
>> On Sun, Dec 07, 2003 at 01:53:43PM -0500, Phillip J. Eby wrote:
>> > to a request made on a Web server.  A container invokes an 
>> application
>> > by calling its ``runCGI`` method, whose signature is defined as
>>
>> Name nit: why include the irrelevant 'CGI' in the name?  Just 'run()' 
>> would
>> be fine.
>
> Well, if you're going to go that route, why not just make it a 
> callable?  :)

Callable or something simpler and more obvious like 'run()' is good if
you exect objects to only talk to one protocol. doCGI is ok (I'd prefer
handle_CGI...) if you think a single object might also want to handle
other protocols (XMLRPC, FTP), as defined by future PEPs.

> I thought about it.  But there are going to be more applications than 
> containers, so why put extra burden on the app side to benefit the few 
> containers that will be threaded?  My conclusion (which others might 
> not share) was that such containers are going to need other per-app 
> configuration settings anyway, like perhaps the path at which the app 
> is located, how many threads maximum to use in a thread pool for that 
> app, and of course how to get the app object in the first place.  
> Thus, there's little added burden for the container to require 
> explicit configuration for threadedness.  It's also possible that what 
> constitutes thread-safety might vary somewhat from container to 
> container.

Although there will be more applications than containers, I doubt that
there will be many that actually implement the Web Container Interface -
sane people will simply subclass StandardWebContainer (to be defined),
since sane people generally don't want to rewrite header formatting,
response buffering, cookie decoding/encoding, POST and QUERY_STRING
decoding, gzip compression, i18n etc.

> And last, but far from least, the more things there are in the spec, 
> the more things there are for people to disagree with or have 
> different interpretations of.  :)

I think it is good to define a bare interface between request brokers
and applications, and CGI is a good common denominator to work from.
The real arguing will be from wanting to have python ship with
a higher level interface implementing this specification. I'm
sure cookies, response headers, streaming & buffering, QUERY_STRING
and POST decoding can all be agreed on without bloodshed, but getting
people to agree that standalone Zope Page Templates should go in too
might be more difficult :-)

- --  
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)

iD8DBQE/0/T+AfqZj7rGN0oRArAvAKCZ3FLT/kcdF7sKAYWd6e0C8+w8nACdFRw1
0kKa88u1VA8f110rJei6KPQ=
=YCkJ
-----END PGP SIGNATURE-----


From amk at amk.ca  Mon Dec  8 06:38:06 2003
From: amk at amk.ca (A.M. Kuchling)
Date: Mon Dec  8 06:38:32 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
Message-ID: <20031208113806.GB2689@rogue.amk.ca>

On Sun, Dec 07, 2003 at 07:05:26PM -0500, Phillip J. Eby wrote:
> Maybe I'm overreacting to being burned by imperfect dictionary simulations 
> in the past.  OTOH, I noticed you haven't actually given a use case for 
> *not* using a dictionary.  :)

os.environ is not a dictionary (nor a subclass of dict), so the simplest CGI
case would be runCGI(sys.stdin, sys.stdout, sys.stderr, os.environ.copy()).
Seems silly.

--amk

From pje at telecommunity.com  Mon Dec  8 09:55:02 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 09:53:02 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
Message-ID: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>

At 02:20 PM 12/8/03 +1100, Stuart Bishop wrote:

>Should environ['REMOTE_USER'] return '', None, or raise a KeyError if the
>web server has performed no authentication on a request? Some keys
>should always have valid values available (REQUEST_METHOD), but others
>only for some requests (CONTENT_LENGTH, REMOTE_USER). We don't want
>applications raising KeyError exceptions when moved to different
>frameworks because of frameworks handling this differently. +1 for using
>None for missing/meaningless value, and accessing any variable defined at
>http://hoohoo.ncsa.uiuc.edu/cgi/env.html will never raise a KeyError.

I'm -1 on it.  This interface is intended to support *existing* application 
frameworks with minimal glue.  For example, I've successfully run both Zope 
2 ZPublisher and Zope 3 zope.publisher under this gateway 
interface.  Putting in 'None' where a sane CGI environment lacks the 
variable is asking for trouble.


>I don't think errors should be a file - we now have a logging package
>so we might as well use it. We could pass in a Logger instance, although
>I'd just scrap the argument and let the handler instantiate the Logger
>if it wants one. The container could define a Handler that sends the log 
>messages to the 'standard' location (eg. CGI's Handler would just be a
>StreamHandler that uses sys.stderr).

"errors" is intended to allow access to the web server's error log, as 
FastCGI and other protocols permit.  There are times when it is very useful 
to see application errors in the same context as server errors, so this is 
included for completeness.  A container is free to provide a different 
destination for the errors stream.

Also, this again is for the greatest possible compatibility with existing 
applications and containers.


>A thread_safety method should be provided by the application. It should
>be specified only once, rather than in every container that invokes the
>application. The thread_level might be generated programatically, eg.
>by querying a DB-API database Connection's thread_safety attribute.

AFAIK, there are only maybe 2 or 3 threaded containers currently available, 
and I don't believe any of them have an option *not* to run threading.  So, 
this seems like a YAGNI to me.  I would prefer there to be actual field 
experience with the minimal spec, in order to decide what kind of threading 
categories would be appropriate.

For example, suppose that a threaded container wishes to configure, instead 
of one application object, a factory for returning new application objects, 
so that there is no threading problem?  I think that a premature attempt to 
define threading models in advance of experience/experimentation would not 
only hold up delivery of a usable spec, but could also close off fruitful 
lines of experimentation for container developers.  I'm similarly concerned 
about other forms of deployment parameterization.


>Or just 'environment should be a standard mapping, or subclass of ``map``
>or ``UserDict``.

Dunno why everyone feels so strongly about that one, but if that's what it 
takes to get through, then perhaps we can decide on a small set of required 
methods.

Or, maybe we could simply require that environ.copy() must always *return* 
a dictionary, and then portable apps would only use the copy.  :)


From pje at telecommunity.com  Mon Dec  8 09:57:34 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 09:55:32 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208033640.GA825@vista.netmemetic.com>
References: <7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
Message-ID: <5.1.0.14.0.20031208095517.01e63020@mail.telecommunity.com>

At 11:36 AM 12/8/03 +0800, Ng Pheng Siong wrote:
>On Mon, Dec 08, 2003 at 02:20:28PM +1100, Stuart Bishop wrote:
> > Should environ['REMOTE_USER'] return '', None, or raise a KeyError if
> > the
> > web server has performed no authentication on a request?
>
>+1 for None.
>
>Zope is able to use REMOTE_USER if the web server sets it, e.g.,

But what does it do if it's set to 'None'?  And even if it's happy with 
this, will the fifty or so other existing application frameworks be happy 
with it?

Compatibility with the vast existing app framework code base demands that 
environ values *must* be strings, or else not present.  (Guess I should add 
that to the spec.)


From amk at amk.ca  Mon Dec  8 10:05:12 2003
From: amk at amk.ca (A.M. Kuchling)
Date: Mon Dec  8 10:05:36 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
References: <5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
Message-ID: <20031208150512.GA3373@rogue.amk.ca>

On Mon, Dec 08, 2003 at 09:55:02AM -0500, Phillip J. Eby wrote:
> interface.  Putting in 'None' where a sane CGI environment lacks the 
> variable is asking for trouble.

Agreed; leave the environment alone, and leave stderr as a file.  If we
start defining logger objects, we're now building yet another framework.

Bonus: most frameworks probably have a method matching this signature already.
For example, in Quixote you could just add a 'runCGI = publish' assignment
to the Publisher class and voila, it's now compatible.

> For example, suppose that a threaded container wishes to configure, instead 
> of one application object, a factory for returning new application objects, 
> so that there is no threading problem?  I think that a premature attempt to

Only the application knows if it can handle threads, though; if there's some
unthreaded global cache, creating new application objects is not going to
make everything threadsafe.  I don't use threads and think their use is
brain-damaged 95% of the time, so I don't really care if there's a
thread-safety mechanism in the spec or not.

--amk

From pje at telecommunity.com  Mon Dec  8 10:18:25 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 10:16:25 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <A636879C-2931-11D8-A22F-000A95A06FC6@stuartbishop.net>
References: <5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
Message-ID: <5.1.0.14.0.20031208095809.03e80150@mail.telecommunity.com>

At 02:50 PM 12/8/03 +1100, Stuart Bishop wrote:
>Although there will be more applications than containers, I doubt that
>there will be many that actually implement the Web Container Interface -
>sane people will simply subclass StandardWebContainer (to be defined),

I presume you mean StandardWebApp, since the container is the component 
that *invokes* the proposed interface.


>since sane people generally don't want to rewrite header formatting,
>response buffering, cookie decoding/encoding, POST and QUERY_STRING
>decoding, gzip compression, i18n etc.

Right.  But, again, consider the existing fifty or so frameworks that do 
this stuff.  With the interface as specified, those framework authors can 
slap a few lines of code on top of their existing setup, and have instant 
comformance.  But, the framework author -- except in rare cases -- is 
probably *not* going to be able to specify thread compliance on behalf of 
the actual user application.  Thus, they're going to have to also design 
some way for the framework's user to specify the level of threading support 
to be flagged by the application object.  That's an unnecessary burden, 
when the container is already going to have to manage other kinds of 
configuration.

Also, consider this: only a very few containers will support 
threading.  mod_python on Apache 1.3 won't be threaded.  Most FastCGI 
implementations aren't.  CGI definitely isn't.  That pretty much leaves 
half-async webservers written in Python, like those belonging to Zope and 
Twisted.  And, it's not clear to me at this point if they will even *care* 
about this.  Even if they do, the thread pool models used by Twisted and 
Zope are probably different in interesting ways that are completely outside 
the scope of this proposal.

Easy backward compatibility is extremely important to this interface.  If 
users have to change their apps to make this work, it's not going to 
fly.  If, on the other hand, a framework developer puts a wrapper on their 
framework, then the app is portable.  What's not portable is configuration 
of the container.  It's one thing for a user to learn how to configure a 
container to run their existing app, and another thing to make them have to 
change the existing app to support a thread safety indicator.

What's more, I have the nightmare vision of an app needing to specify 
different thread safety levels for different containers, because of the way 
those containers handle different threading levels.  Explicit 
(configuration of the container) is better than implicit (funnelling a 
safety flag up from an app, through a framework to the interface, for the 
container to then interpret according to its own schema).


>>And last, but far from least, the more things there are in the spec, the 
>>more things there are for people to disagree with or have different 
>>interpretations of.  :)
>
>I think it is good to define a bare interface between request brokers
>and applications, and CGI is a good common denominator to work from.

Great.


>The real arguing will be from wanting to have python ship with
>a higher level interface implementing this specification.

You mean on the application side, I presume?  Containers in the standard 
library (or adapters from the existing containers to allow invocation of 
conforming apps) should be non-controversial.


>I'm
>sure cookies, response headers, streaming & buffering, QUERY_STRING
>and POST decoding can all be agreed on without bloodshed, but getting
>people to agree that standalone Zope Page Templates should go in too
>might be more difficult :-)

I'd assume that a trivial "CGIApp" class would be written so as to simply 
create a cgi.FieldStorage, and invoke an abstract method.  Anything more 
than that would be encroaching on highly disputed and disputable 
territory.  :)  (And perhaps not really needed, anyway.)

But I care almost nothing about the stdlib effects of this proposal for the 
moment.  Anything that happens, won't happen until 2.4.  But, if this 
becomes the "community standard" interface *now*, then framework developers 
can start splitting their container code from their framework code, and 
expand the reach of both their containers and their frameworks.  And they 
can do it with existing code, today, on older Pythons.  That, I think, is 
something worth working towards.


From pje at telecommunity.com  Mon Dec  8 10:29:40 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 10:27:39 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208113806.GB2689@rogue.amk.ca>
References: <5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207184325.03123440@mail.telecommunity.com>
Message-ID: <5.1.0.14.0.20031208102419.03707ec0@mail.telecommunity.com>

At 06:38 AM 12/8/03 -0500, A.M. Kuchling wrote:
>On Sun, Dec 07, 2003 at 07:05:26PM -0500, Phillip J. Eby wrote:
> > Maybe I'm overreacting to being burned by imperfect dictionary simulations
> > in the past.  OTOH, I noticed you haven't actually given a use case for
> > *not* using a dictionary.  :)
>
>os.environ is not a dictionary (nor a subclass of dict), so the simplest CGI
>case would be runCGI(sys.stdin, sys.stdout, sys.stderr, os.environ.copy()).
>Seems silly.

The copy() in that case would arguably be necessary anyway.  Remember that 
the spec requires the caller to be allowed to *modify* environ in place.

Anyway, as per my response to Stuart, I suppose I could further compromise 
to having the spec require that the copy() method return a 
dictionary.  Then people who want to be sure their manipulations are 
portable, can simply take a copy of environ.  (Or, alternatively, .items() 
could be required, and the portable mechanism would be to use 
'dict(environ.items())'.)

But, given how simple it is for the container to use a dictionary in the 
first place, it seems silly to force every layer to do a copy "just in 
case" to be portable.  And, I think that os.environ really is the exception 
rather than the rule.  How many existing containers use non-dictionaries now?


From pje at telecommunity.com  Mon Dec  8 10:35:47 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 10:33:45 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208150512.GA3373@rogue.amk.ca>
References: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
Message-ID: <5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>

At 10:05 AM 12/8/03 -0500, A.M. Kuchling wrote:
>On Mon, Dec 08, 2003 at 09:55:02AM -0500, Phillip J. Eby wrote:
> > For example, suppose that a threaded container wishes to configure, 
> instead
> > of one application object, a factory for returning new application 
> objects,
> > so that there is no threading problem?  I think that a premature attempt to
>
>Only the application knows if it can handle threads, though; if there's some
>unthreaded global cache, creating new application objects is not going to
>make everything threadsafe.

My point is that no matter what, if you use a container, you have to 
configure it with a bunch of other facts about your application.  So, you 
might as well explicitly configure any threading-related settings in their 
*native form*.  That is, whatever threading settings the *container* has, 
whatever they might be.  Making the app or framework declare their safety 
through a narrow interface on the application object seen by the container 
incurs needlessly "lossy" transfer of information.

So, IMO, threading configuration should be part of container configuration, 
not part of the application interface.


>   I don't use threads and think their use is
>brain-damaged 95% of the time,

+1.


>so I don't really care if there's a
>thread-safety mechanism in the spec or not.

And I actively *don't* want it, because it will interfere with the ability 
of container authors to add support for this interface, especially if they 
*do* support threading now.


From ngps at post1.com  Mon Dec  8 10:39:01 2003
From: ngps at post1.com (Ng Pheng Siong)
Date: Mon Dec  8 10:40:44 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031208095517.01e63020@mail.telecommunity.com>
References: <7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
	<5.1.0.14.0.20031208095517.01e63020@mail.telecommunity.com>
Message-ID: <20031208153901.GA367@vista.netmemetic.com>

On Mon, Dec 08, 2003 at 09:57:34AM -0500, Phillip J. Eby wrote:
> >+1 for None.
> >
> >Zope is able to use REMOTE_USER if the web server sets it, e.g.,
> 
> But what does it do if it's set to 'None'?  

Let's see...

  ~/pkg/zope262/lib/python/ZPublisher$ egrep -i remote_user *.py

  BaseRequest.py:    elif request.environ.has_key('REMOTE_USER'):
  BaseRequest.py:        name=request.environ['REMOTE_USER']
  HTTPRequest.py:        'REMOTE_USER' : 1,
  Publish.py:        if realm and not request.get('REMOTE_USER',None):

If it is None, Zope does nothing about it, I suppose.

ZServerSSL...

    def get_environment(self, request):
        env = zhttps0_handler.get_environment(self, request)
        peer = request.channel.get_peer_cert()
        if peer is not None:
            env['REMOTE_USER'] = str(peer.get_subject())
        return env

(Oh, ok, it's just a setter. I'd forgotten.)

ZServerSSL sets REMOTE_USER for RemoteUserFolder's consumption.


> Compatibility with the vast existing app framework code base demands that 
> environ values *must* be strings, or else not present.  (Guess I should add 
> that to the spec.)

Looking at RemoteUserFolder:

        name = request.environ.get('REMOTE_USER', None)
        name = self.normalizeName(name)
        #LOG('RemoteUserFolder', INFO, 'validate %s' % str(name) )        
        if name is None:
            ...

Well, the plural of `anecdote' is not `data' ;-), but it does seem to me
the following 2 styles will be dominant:

1)
    x = dict.get('XX', None)
    if x is None:
        ...

2)
    if dict.has_key('XX'):
        ...

So I'm guessing it is not terribly important whether it is '' or None.

Cheers.

-- 
Ng Pheng Siong <ngps@netmemetic.com> 

http://firewall.rulemaker.net     -+- All Your Rulebase Are Belong To You[tm]
http://sandbox.rulemaker.net/ngps -+- Open Source Python Crypto & SSL

From amk at amk.ca  Mon Dec  8 11:18:00 2003
From: amk at amk.ca (A.M. Kuchling)
Date: Mon Dec  8 11:18:25 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>
References: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>
Message-ID: <20031208161800.GA4146@rogue.amk.ca>

On Mon, Dec 08, 2003 at 10:35:47AM -0500, Phillip J. Eby wrote:
> whatever they might be.  Making the app or framework declare their safety 
> through a narrow interface on the application object seen by the container 
> incurs needlessly "lossy" transfer of information.

And that convinces me; forget about threading.

--amk

From pje at telecommunity.com  Mon Dec  8 12:09:17 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Dec  8 12:09:40 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208153901.GA367@vista.netmemetic.com>
References: <5.1.0.14.0.20031208095517.01e63020@mail.telecommunity.com>
	<7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<7904F90F-292D-11D8-A22F-000A95A06FC6@stuartbishop.net>
	<5.1.0.14.0.20031208095517.01e63020@mail.telecommunity.com>
Message-ID: <5.1.1.6.0.20031208120413.02a813b0@telecommunity.com>

At 11:39 PM 12/8/03 +0800, Ng Pheng Siong wrote:
>On Mon, Dec 08, 2003 at 09:57:34AM -0500, Phillip J. Eby wrote:
> > >+1 for None.
> > >
> > >Zope is able to use REMOTE_USER if the web server sets it, e.g.,
> >
> > But what does it do if it's set to 'None'?
>
>Let's see...
>
>   ~/pkg/zope262/lib/python/ZPublisher$ egrep -i remote_user *.py
>
>   BaseRequest.py:    elif request.environ.has_key('REMOTE_USER'):
>   BaseRequest.py:        name=request.environ['REMOTE_USER']
>   HTTPRequest.py:        'REMOTE_USER' : 1,
>   Publish.py:        if realm and not request.get('REMOTE_USER',None):
>
>If it is None, Zope does nothing about it, I suppose.

Did you trace every use of 'name' after it's set from REMOTE_USER, to be 
sure that it's okay for it to be None?

I'm not saying there's a problem, I'm saying that it's silly to force the 
authors of every framework to go hunt down every existing use of *every* 
environment variable to be sure they're safe with them being None.


>Well, the plural of `anecdote' is not `data' ;-), but it does seem to me

No kidding.  Even if "some" set of frameworks are okay with None, that's 
not the same as "all" frameworks.  OTOH, any currently correct code will 
work if we *don't* use None or '', making that approach immeasurably 
superior from an "immediate adoption ability" point of view.


>the following 2 styles will be dominant:
>
>1)
>     x = dict.get('XX', None)
>     if x is None:
>         ...
>
>2)
>     if dict.has_key('XX'):
>         ...
>
>So I'm guessing it is not terribly important whether it is '' or None.

Actually, you've just given evidence that it is VERY important.  Code that 
currently uses 'has_key' (or 'in') will BREAK if we put None OR '' for 
non-existent keys.  Non-existent keys are clearly critical for backward 
compatibility with the second style you show above.


From gstein at lyra.org  Mon Dec  8 19:54:00 2003
From: gstein at lyra.org (Greg Stein)
Date: Mon Dec  8 19:56:55 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208161800.GA4146@rogue.amk.ca>;
	from amk@amk.ca on Mon, Dec 08, 2003 at 11:18:00AM -0500
References: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>
	<20031208161800.GA4146@rogue.amk.ca>
Message-ID: <20031208165400.G15042@lyra.org>

On Mon, Dec 08, 2003 at 11:18:00AM -0500, A.M. Kuchling wrote:
> On Mon, Dec 08, 2003 at 10:35:47AM -0500, Phillip J. Eby wrote:
> > whatever they might be.  Making the app or framework declare their safety 
> > through a narrow interface on the application object seen by the container 
> > incurs needlessly "lossy" transfer of information.
> 
> And that convinces me; forget about threading.

I'm not convinced. If an application is designed with a per-process model
in mind (e.g. CGI), and then you drop it into a threaded model... BOOM!

The application needs to declare whether it is thread-safe. The container
can then verify whether that application can be run within the container
and the container's current configuration.

For example, if you drop a non-thread-safe app into a threaded mod_python,
then I would expect an error to be thrown, and the app to *not* be loaded.

The simple fact is that threading (and the execution model, in general) is
part of the environment. You can't limit it to just the three streams plus
some "environ" dictionary. There is a *very* real impact on the
application, based on how the container is executing those apps.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

From titus at caltech.edu  Mon Dec  8 20:12:02 2003
From: titus at caltech.edu (Titus Brown)
Date: Mon Dec  8 20:12:05 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208165400.G15042@lyra.org>
References: <5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>
	<20031208161800.GA4146@rogue.amk.ca>
	<20031208165400.G15042@lyra.org>
Message-ID: <20031209011202.GA1822@caltech.edu>

-> > > whatever they might be.  Making the app or framework declare their safety 
-> > > through a narrow interface on the application object seen by the container 
-> > > incurs needlessly "lossy" transfer of information.
-> > 
-> > And that convinces me; forget about threading.
-> 
-> I'm not convinced. If an application is designed with a per-process model
-> in mind (e.g. CGI), and then you drop it into a threaded model... BOOM!
-> 
-> The application needs to declare whether it is thread-safe. The container
-> can then verify whether that application can be run within the container
-> and the container's current configuration.
-> 
-> For example, if you drop a non-thread-safe app into a threaded mod_python,
-> then I would expect an error to be thrown, and the app to *not* be loaded.
-> 
-> The simple fact is that threading (and the execution model, in general) is
-> part of the environment. You can't limit it to just the three streams plus
-> some "environ" dictionary. There is a *very* real impact on the
-> application, based on how the container is executing those apps.

I agree; often only a little bit of thought is needed to make sure
something is thread safe, but that thought should be added into
the framework ahead of time.

cheers,
--titus

From grisha at modpython.org  Tue Dec  9 12:29:58 2003
From: grisha at modpython.org (Gregory (Grisha) Trubetskoy)
Date: Tue Dec  9 12:30:04 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
Message-ID: <20031209122505.T62979@onyx.ispol.com>


I must say this is very well written, and seeing such level of
thoroughness on this list makes me very hopeful, because it produces a
substantive discussion. My hat's off to Phillip for making the effort to
write this.

Having said that, I'm -1 on this PEP. I think it does a very good job of
stating the problem (and this in itself is immensely valuable), but I do
not agree with the solution. I think it trades efficiency for simplicity,
and to paraphrase Franklin, if you give up one for the other you will get
neither :-)

The approach this spec takes is modeled after CGI, which was designed with
shell scripts in mind and condenses things down to the UNIX primitives of
stdin, stdout, stderr, environ (and cwd).

On the surface this appears fine, but consider setting an HTTP header.
Headers do not fit into the above-mentioned primitives, so CGI requires
the application to send them to stdout. Writing headers to stdout is much
more cumbersome than passing them in a mapping object of some sort. And
most web server's CGI implementations do not pass the header portion of
stdout straight to the client. They actually parse those headers,
optionally alter them and adjust their own behavior based on the header
information, then add the resulting data to the server header structure
(e.g. headers_out table in case of Apache). This is inefficient, and ugly.
And it is a direct consequence of the way CGI is specified.

It is understandable why CGI does it, given that CGI was meant for running
executables in a separate process on UNIX to serve a request. But there is
no reason why such limitations should be carried over to environments that
do not have the constraints of CGI. Especially considering that the whole
idea of running an executable to serve an HTTP request looks pretty weird
as a way to develop web applications these days.

Whatever spec we come up with, IMO should deal in terms of the HTTP
protocol request, headers, body, etc. Trying to narrow it down to input,
output and environment is fitting a square peg into a round hole.

Three other notes:

1. On the threading point - aside from thread-safety there is another big
issue, it's the shared memory space. Some frameworks assume that they are
running in one process and take it for granted that making something
global will make it available to all other requests, which obviously isn't
going to work on per-process servers.

2. If we're going to refer to a CGI specification, then we should rely on
the RFC draft at http://cgi-spec.golux.com/. The stuff at NCSA's hoohoo
page is more of a joke than a spec.

3. Mod_python *can* be threaded on apache 1.3, because 1.3 is threaded on
Windows. Considering that Apache, IIS and iPlanet (or whatever it's called
now) account for vast majority of the web servers out there, there are
likely more threaded servers than not threaded, so I wouldn't through out
thread-safety as a non-consideration.

Grisha


From pje at telecommunity.com  Tue Dec  9 14:32:06 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Tue Dec  9 14:32:17 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031209122505.T62979@onyx.ispol.com>
Message-ID: <5.1.1.6.0.20031209134549.02a98ec0@telecommunity.com>

At 12:29 PM 12/9/03 -0500, Gregory (Grisha) Trubetskoy wrote:
>On the surface this appears fine, but consider setting an HTTP header.
>Headers do not fit into the above-mentioned primitives, so CGI requires
>the application to send them to stdout. Writing headers to stdout is much
>more cumbersome than passing them in a mapping object of some sort.

This is a non-problem.  You can't wave your hand without hitting at least 
half a dozen *already written*, documented, even supported libraries that 
handle this in as many ways as one might like.  And plenty of people 
obviously find them to be of adequate performance and usability.


>  And
>most web server's CGI implementations do not pass the header portion of
>stdout straight to the client. They actually parse those headers,
>optionally alter them and adjust their own behavior based on the header
>information, then add the resulting data to the server header structure
>(e.g. headers_out table in case of Apache). This is inefficient, and ugly.

...and implemented, and documented, and portable, and highly available, and 
widely accepted.

Practicality beats purity.


>Whatever spec we come up with, IMO should deal in terms of the HTTP
>protocol request, headers, body, etc. Trying to narrow it down to input,
>output and environment is fitting a square peg into a round hole.

I think perhaps there's some confusion about the PEP's goals here.  It is 
in no way intended to be an ideal spec, a pure spec, or an efficient 
spec.  It's *absolutely* not trying to be another framework.  It is aimed 
solely at being an *implemented* and *universally available* spec -- right 
now, today, without waiting for another version of Python or trying to 
convince people to use it *in place of* their existing working 
tools.  Rather, the spec should enable people to use other tools in 
*addition* to their existing ones.

Note that this does not preclude the existence of other specifications for 
more advanced capabilities.  However, such specs will naturally be less 
frequently available or implemented.  Meanwhile, there's scarcely a server 
in existence that doesn't support CGI.


>Three other notes:
>
>1. On the threading point - aside from thread-safety there is another big
>issue, it's the shared memory space. Some frameworks assume that they are
>running in one process and take it for granted that making something
>global will make it available to all other requests, which obviously isn't
>going to work on per-process servers.

That's true.  Such frameworks, however, will need to document that they 
will only work in single-process containers.  Users will then correctly 
perceive this as a limitation of the framework.

But it's an important point to add to the spec.  Thanks for pointing it out.


>2. If we're going to refer to a CGI specification, then we should rely on
>the RFC draft at http://cgi-spec.golux.com/. The stuff at NCSA's hoohoo
>page is more of a joke than a spec.

Thanks for the reference; I took the first thing that came up in Google 
that seemed informative.  :)


>3. Mod_python *can* be threaded on apache 1.3, because 1.3 is threaded on
>Windows.

My mistake, sorry.


>Considering that Apache, IIS and iPlanet (or whatever it's called
>now) account for vast majority of the web servers out there, there are
>likely more threaded servers than not threaded, so I wouldn't through out
>thread-safety as a non-consideration.

I was referring to existing, available containers written in/for Python 
code, but I can certainly see that might be the case.  But it still doesn't 
change the possibility of containers having different threading 
models.  For example, some servers may use dedicated per-application thread 
pools.  Others might have a generic thread pool.  Some might pre-allocate 
application objects, others might allocate on demand.

Whatever the model, these are things that a container must 
configure.  Explicit being better than implicit, it would be better to 
configure these things in the container.  And because "in the face of 
ambiguity, refuse the temptation to guess," I don't want to guess what 
threading settings can or should exist and define a spec for them, nor 
should containers try to guess their settings from an ambiguous "I'm (not) 
threadsafe" flag.

Thus, the intent to merely provide a transport conduit, rather than a 
configuration mechanism.  I'd prefer to leave a threading spec to version 
2.0, *after* there's widespread adoption -- and therefore widespread 
experience with -- the needs of containers and the issues of applications 
operating under the 1.0 spec.


From pje at telecommunity.com  Tue Dec  9 14:43:59 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Tue Dec  9 14:44:05 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031208165400.G15042@lyra.org>
References: <20031208161800.GA4146@rogue.amk.ca>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031207133404.023e1070@mail.telecommunity.com>
	<5.1.0.14.0.20031208094130.03e80bd0@mail.telecommunity.com>
	<5.1.0.14.0.20031208103018.03e847b0@mail.telecommunity.com>
	<20031208161800.GA4146@rogue.amk.ca>
Message-ID: <5.1.1.6.0.20031209143316.00aba080@telecommunity.com>

At 04:54 PM 12/8/03 -0800, Greg Stein wrote:

>I'm not convinced. If an application is designed with a per-process model
>in mind (e.g. CGI), and then you drop it into a threaded model... BOOM!
>
>The application needs to declare whether it is thread-safe. The container
>can then verify whether that application can be run within the container
>and the container's current configuration.
>
>For example, if you drop a non-thread-safe app into a threaded mod_python,
>then I would expect an error to be thrown, and the app to *not* be loaded.
>
>The simple fact is that threading (and the execution model, in general) is
>part of the environment. You can't limit it to just the three streams plus
>some "environ" dictionary. There is a *very* real impact on the
>application, based on how the container is executing those apps.

I'll add a "Threading and Process Issues" section to the PEP, explicitly 
addressing the types of issues that could occur (that we know of at 
present), and recommending what framework and container authors should 
document about their framework or container's requirements or 
capabilities.  However, I think that trying to establish a metadata 
standard for these issues is premature, and should be left to a version 
2.0, similar to the way the DBAPI 2.0 added a threading metadata 
specification, after driver authors and users had some experience with what 
kinds of issues existed.

(Note that although lots of people have so far said "threading is important 
and should be in the spec", nobody has said, "this is what the spec should 
say about it."  I'm taking this as an indication that nobody really knows 
what it should say, and that it's therefore premature to specify it.)

Anyway, attempting to summarize the issues raised so far:

* A framework that uses globals for inter-request communication will fail 
in a multi-process container
* A framework that uses files or shared memory as an IPC mechanism will 
fail in a multi-server cluster container
* Some frameworks are not thread-safe unless multiple application objects 
are created
* Some frameworks are not thread-safe *even if* multiple application 
objects are created
* Some frameworks may require explicit flagging, or other special coding 
practices in order to be thread-safe

Have I missed anything?


From paul.boddie at ementor.no  Wed Dec 10 10:48:39 2003
From: paul.boddie at ementor.no (Paul Boddie)
Date: Wed Dec 10 10:48:44 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
Message-ID: <FD72AF7813F1294C95279EC6D9784A2F4C7CE1@100NOOSLMSG004.common.alpharoot.net>

Gregory (Grisha) Trubetskoy wrote:
>
> The approach this spec takes is modeled after CGI, which was designed with
> shell scripts in mind and condenses things down to the UNIX primitives of
> stdin, stdout, stderr, environ (and cwd).

I would have thought that this kind of interface would have been more
suitable between the environment and the container, or possibly between
components within the container.

> On the surface this appears fine, but consider setting an HTTP header.
> Headers do not fit into the above-mentioned primitives, so CGI requires
> the application to send them to stdout. Writing headers to stdout is much
> more cumbersome than passing them in a mapping object of some sort.

And I can imagine that for many applications in many of the current
frameworks, they would need some kind of "insulating wrapper" to comply with
this interface. Certainly, I don't recall Webware, mod_python, Twisted or
Zope applications sending headers to the same output stream as the data (or
even using an output stream for the headers at all).

[...]

> Whatever spec we come up with, IMO should deal in terms of the HTTP
> protocol request, headers, body, etc. Trying to narrow it down to input,
> output and environment is fitting a square peg into a round hole.

Agreed. I think we also need to consider where this interface "surfaces" in
the application or framework; ie. where you would expect to find it, and
what might sit on top. As I noted above, right now, many applications would
need a few framework calls between the invocation of the runCGI function and
an actual entry point into the application itself.

This pre-PEP seems to serve an important purpose: it attempts to make a
certain part of the Web request handling "stack" explicit. I'd certainly be
interested in trying to make other parts of that "stack" more obvious, too.
For example, it would be nice to consider the resolution of requests
according to information contained within them, and the dispatching of such
requests to resources. Right now, each framework seems to have its own
ideology which states that requests using particular paths must get resolved
in a particular way - it would be great if an API appeared that let
developers rewire frameworks without resorting to external hacks to get the
desired behaviour.

Paul

From jjl at pobox.com  Wed Dec 10 12:15:35 2003
From: jjl at pobox.com (John J Lee)
Date: Wed Dec 10 12:16:01 2003
Subject: [Web-SIG] [Python-Dev] PEP 292 and templating (fwd)
Message-ID: <Pine.LNX.4.58.0312100311180.17714@alice>

---------- Forwarded message ----------
Date: Tue, 9 Dec 2003 21:55:55 -0500
From: Raymond Hettinger <raymond.hettinger@verizon.net>
Reply-To: python@rcn.com
To: python-dev@python.org
Subject: [Python-Dev] PEP 292 and templating

Is there interest in having a templating module with two functions one
for simple substitutions and the other with more tools?

The first would be Barry's simple substitutions using only $name or
${name} for templates exposed to the user.

The second would extend the first with Cheetah style dotted names for
more advanced templates controlled by the programmer.


Raymond Hettinger
-------------- next part --------------
_______________________________________________


Python-Dev mailing list


Python-Dev@python.org


http://mail.python.org/mailman/listinfo/python-dev


Unsubscribe: http://mail.python.org/mailman/options/python-dev/jjl%40pobox.com


From pje at telecommunity.com  Wed Dec 10 13:17:36 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Wed Dec 10 13:17:49 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <FD72AF7813F1294C95279EC6D9784A2F4C7CE1@100NOOSLMSG004.comm
	on.alpharoot.net>
Message-ID: <5.1.1.6.0.20031210113456.02b183c0@telecommunity.com>

At 04:48 PM 12/10/03 +0100, Paul Boddie wrote:
>Gregory (Grisha) Trubetskoy wrote:
> >
> > The approach this spec takes is modeled after CGI, which was designed with
> > shell scripts in mind and condenses things down to the UNIX primitives of
> > stdin, stdout, stderr, environ (and cwd).
>
>I would have thought that this kind of interface would have been more
>suitable between the environment and the container, or possibly between
>components within the container.

I'm guessing that for mod_python, the proposed interface isn't as suitable, 
doubtless prompting some of Grisha's concerns.  From a mod_python point of 
view, the proposed interface is "lossy", at least from a performance point 
of view, and probably also from a power/flexibility point of view.

But the flip side is that if this "lossy" interface were available in 
mod_python, it would actually bring many more users to mod_python, since 
they'd be able to use a wider variety of frameworks with it.  If those 
users then came to want things that weren't available through the narrow 
"runCGI" interface, then they could consider doing additional work to use 
mod_python's native interface.

I know that I, for one, would be more likely to experiment with other 
mod_python capabilities once I had my "foot in the door" via the simple 
interface.


> > On the surface this appears fine, but consider setting an HTTP header.
> > Headers do not fit into the above-mentioned primitives, so CGI requires
> > the application to send them to stdout. Writing headers to stdout is much
> > more cumbersome than passing them in a mapping object of some sort.
>
>And I can imagine that for many applications in many of the current
>frameworks, they would need some kind of "insulating wrapper" to comply with
>this interface. Certainly, I don't recall Webware, mod_python, Twisted or
>Zope applications sending headers to the same output stream as the data (or
>even using an output stream for the headers at all).

Zope definitely does, and from Andrew's comments, so does Quixote.  Twisted 
is a web server, so it won't, but I believe it already has a CGI interface 
for running external programs, that could be used for this purpose 
(presumably by running the application in a separate thread).

Here are example wrappers for Zope 2 and Zope 3 (untested, but based on 
existing code I use in production (Z2) and dev (Z3)):

class Zope2App:

     def __init__(self, modulename):
         self.moduleToPublish = modulename

     def runCGI(self,input,output,errors,environ):
         from ZPublisher.Publish import publish_module
         publish_module(
             self.moduleToPublish, stdin=input, stdout=output,
             stderr=errors, environ=environ
         )

class Zope3App:

     _browser_methods = 'GET','HEAD','POST

     def __init__(self, publication):
         self.policy = publication

     def runCGI(self,input,output,errors,environ):

         from zope.publisher import http, browser, xmlrpc, publish

         method = environ.get('REQUEST_METHOD', 'GET').upper()

         if method in self._browser_methods:
             if (method == 'POST' and
                 env.get('CONTENT_TYPE', '').lower().startswith('text/xml')
                 ):
                 request_type = xmlrpc.XMLRPCRequest
             else:
                 request_type = browser.BrowserRequest
         else:
             request_type = http.HTTPRequest

         request = request_type(input, output, environ)

         request.setPublication(self.policy)
         publish.publish(request)


>This pre-PEP seems to serve an important purpose: it attempts to make a
>certain part of the Web request handling "stack" explicit. I'd certainly be
>interested in trying to make other parts of that "stack" more obvious, too.
>For example, it would be nice to consider the resolution of requests
>according to information contained within them, and the dispatching of such
>requests to resources. Right now, each framework seems to have its own
>ideology which states that requests using particular paths must get resolved
>in a particular way - it would be great if an API appeared that let
>developers rewire frameworks without resorting to external hacks to get the
>desired behaviour.

The proposed interface actually allows that too; in fact, it's why environ 
must be modifiable by the "app".  It should be easy to create a "router" 
app that accepts a runCGI call and forwards it to other application objects 
implementing the interface.  Thus, multiple frameworks, apps, or other 
objects can be "mounted" within a container even at a single virtual "mount 
point".


From neel at mediapulse.com  Wed Dec 10 14:15:10 2003
From: neel at mediapulse.com (Michael C. Neel)
Date: Wed Dec 10 14:15:16 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
Message-ID: <C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse.net>

> I'm guessing that for mod_python, the proposed interface 
> isn't as suitable, 
> doubtless prompting some of Grisha's concerns.  From a 
> mod_python point of 
> view, the proposed interface is "lossy", at least from a 
> performance point 
> of view, and probably also from a power/flexibility point of view.
> 
> But the flip side is that if this "lossy" interface were available in 
> mod_python, it would actually bring many more users to 
> mod_python, since 
> they'd be able to use a wider variety of frameworks with it.  
> If those 
> users then came to want things that weren't available through 
> the narrow 
> "runCGI" interface, then they could consider doing additional 
> work to use 
> mod_python's native interface.
> 
> I know that I, for one, would be more likely to experiment with other 
> mod_python capabilities once I had my "foot in the door" via 
> the simple 
> interface.

This highlights my two concerns with this PEP, which I've been following
the thread on.

One is how willing are developers of the current systems to rewrite or
provide a wrapper for this new one?  Off the top of my head I know
mod_python has for it: (it's own) PSP and Publisher, Albatross, Spyce,
and Draco.  Can we really expect all of these to update to use this new
standard?  Or do we just want mod_python to expose another interface?

Which leads to my other concern; should this even be a concern?  The
goal here is to update/add to the stdlib.  Since the odds of mod_python
becoming part of the stdlib are nil, should we even worry about a spec
for things like mod_python and Zope?

I freely admit I don't "get it" yet, and may be missing the bigger
picture.  This sounds to me like a Java server type of thing - a generic
enough framework when I can take my app from one system to another with
no changes needed.  While I need my client side to be as flexible as
possbible, it's extreamly rare that in pratice it's needed at the server
side because it's rare the whole platform changes (and usally when it
does it along with a rewrite/upgrade to the app anyway, making keeping
the code even less useful).

That said, I want anything in the stdlib to jive, so that if I change
from one class to another (for the same role), they both expose the same
interface.  So in that scope, I see something like this being very
helpful.

Mike

From amk at amk.ca  Wed Dec 10 17:50:57 2003
From: amk at amk.ca (A.M. Kuchling)
Date: Wed Dec 10 17:51:28 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse.net>
References: <C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse.net>
Message-ID: <20031210225057.GA13911@rogue.amk.ca>

On Wed, Dec 10, 2003 at 02:15:10PM -0500, Michael C. Neel wrote:
> Which leads to my other concern; should this even be a concern?  The
> goal here is to update/add to the stdlib.  Since the odds of mod_python
> becoming part of the stdlib are nil, should we even worry about a spec
> for things like mod_python and Zope?

No, adding to the stdlib is not necessarily the goal.  The DB-API isn't
represented in the stdlib either, yet it's still useful for ensuring a
certain amount of consistency between database modules.  Authors of modules
can follow the API or not, and they're only responsible to their users  
about whether they do.

Similarly, this PEP is an informational document describing a certain
convention that web frameworks can follow or not, as they see fit.  And it
helps alleviate the O(n**2) problem of connecting various publishing schemes
together.  Want to run Quixote under Twisted?  Go write an adapter.  Want to
run Webware under SCGI.  Go write an adapter.  If each piece supported this
interface, at least it would be fairly easy to combine tools without having
to write a different chunk of adapter code for each possible pair.

--amk

From pje at telecommunity.com  Wed Dec 10 18:42:39 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Wed Dec 10 18:42:46 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse. net>
Message-ID: <5.1.1.6.0.20031210181755.02a64ca0@telecommunity.com>

At 02:15 PM 12/10/03 -0500, Michael C. Neel wrote:
>One is how willing are developers of the current systems to rewrite or
>provide a wrapper for this new one?  Off the top of my head I know
>mod_python has for it: (it's own) PSP and Publisher, Albatross, Spyce,
>and Draco.  Can we really expect all of these to update to use this new
>standard?  Or do we just want mod_python to expose another interface?

Yes.  But note that it's not necessarily the authors of mod_python that 
have to provide it.  Somebody that wants to run PyWCI apps under mod_python 
could write a PyWCI container that runs under the existing mod_python 
API.  However, somebody would only need to write this once, for everybody 
to be able to take advantage of it under mod_python.  And, other frameworks 
would need only to expose a PyWCI-compliant 'runCGI' method, to be able to 
run in that container (assuming that their process model was compatible).


>Which leads to my other concern; should this even be a concern?  The
>goal here is to update/add to the stdlib.

That's a minor and mostly tangential concern for the proposal as such.  I 
posted the proposal here before putting it out in the wider world of 
python-list, because:

1) the proposal offers some direction for an interface between any new 
stdlib container pieces and any application-like pieces

2) There's lots of web framework and container authors here, who presumably 
have some interest in Python "web standards".  So, I assumed that the best 
peer review for early feedback would be found here.

So, my goals for the proposal are really orthogonal to the standard library 
goals of the Web-SIG, but are nonetheless of interest to the Web-SIG 
membership, if that makes sense.


>I freely admit I don't "get it" yet, and may be missing the bigger
>picture.  This sounds to me like a Java server type of thing - a generic
>enough framework when I can take my app from one system to another with
>no changes needed.

Assuming that your threading and/or process model are compatible, yes, you 
should have your choice of containers for physical deployment of the 
app.  But there are bigger gains than that to be had.  See below.


>   While I need my client side to be as flexible as
>possbible, it's extreamly rare that in pratice it's needed at the server
>side because it's rare the whole platform changes (and usally when it
>does it along with a rewrite/upgrade to the app anyway, making keeping
>the code even less useful).

That's all true, but not the point of the proposal.  The issue is user 
choice when initially *selecting* the container.  Right now, your runtime 
platform needs can drastically affect your options for what kind of 
framework you can use, because what frameworks you can use depends heavily 
on what kind of runtime container you need to support.

With widespread adoption of PyWCI, your container choice would not 
significantly narrow your framework choice, and you would also have the 
option of mixing frameworks by using a PyWCI-based request router.

So, it's not so much about being able to *move* your application (although 
it's nice to know you can "move up" or "move sideways" as needed), as it is 
about being able to have more choices in the first place.

The thing that creates user uncertainty about Python web programming right 
now is *not* that there are dozens of choices.  It's that you have to pick 
*one*, and then you're probably stuck with it.  And *none* of your learning 
or runtime environment may stay with you if you switch.  The mere 
*existence* of a widely-supported container interface will be a significant 
peace-of-mind booster for PHB's and developers alike.


>That said, I want anything in the stdlib to jive, so that if I change
>from one class to another (for the same role), they both expose the same
>interface.  So in that scope, I see something like this being very
>helpful.

Yes, and this ties into my point about having a widely-supported 
"standard".  But, my intent is to bootstrap the standard into widespread 
use, without necessarily going through the stdlib first.

In the past, Guido has seemed to me to prefer to base the stdlib on "de 
facto" standards representing community experience, over "de jure" 
standards representing what people think might be a good idea.  Thus, if 
PyWCI were widely implemented, that would be in itself a justification for 
its use in the standard library, and thus beneficial to the Web-SIG's 
efforts in that regard.


From grisha at modpython.org  Wed Dec 10 23:52:45 2003
From: grisha at modpython.org (Gregory (Grisha) Trubetskoy)
Date: Wed Dec 10 23:52:48 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031210225057.GA13911@rogue.amk.ca>
References: <C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse.net>
	<20031210225057.GA13911@rogue.amk.ca>
Message-ID: <20031210233452.D92881@onyx.ispol.com>


On Wed, 10 Dec 2003, A.M. Kuchling wrote:

> Similarly, this PEP is an informational document describing a certain
> convention that web frameworks can follow or not, as they see fit.  And it
> helps alleviate the O(n**2) problem of connecting various publishing schemes
> together.  Want to run Quixote under Twisted?  Go write an adapter.  Want to
> run Webware under SCGI.  Go write an adapter.  If each piece supported this
> interface, at least it would be fairly easy to combine tools without having
> to write a different chunk of adapter code for each possible pair.

The PEP will help with this problem, and as such I'm willing to support
it, but at the same I won't with all honesty be albe to say "problem
solved" in the best possible way or even that we are moving in that
direction. (But I think we agree with Phillip on this).

I really liked the problem statement in the PEP; perhaps we can add a note
to it that the problem can have a much more comprehensive solution and
that the solution described, although simple, isn't the most efficient and
in many ways defficient. This will shut up people like me who will read
the PEP and say "But this is just the old lame CGI?".

The real solution IMHO opinion is going to be something similar to Java
Servlet specification. It's a pretty complex issue, probably enough so to
start a whole separate SIG on.

Grisha


From pje at telecommunity.com  Thu Dec 11 00:10:31 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Thu Dec 11 00:08:45 2003
Subject: [Web-SIG] Pre-PEP: Python Web Container Interface v1.0
In-Reply-To: <20031210233452.D92881@onyx.ispol.com>
References: <20031210225057.GA13911@rogue.amk.ca>
	<C0FC22C08B82074A88B500617641577787A7E4@johnson.mediapulse.net>
	<20031210225057.GA13911@rogue.amk.ca>
Message-ID: <5.1.0.14.0.20031210235844.03b8aec0@mail.telecommunity.com>

At 11:52 PM 12/10/03 -0500, Gregory (Grisha) Trubetskoy wrote:

>I really liked the problem statement in the PEP; perhaps we can add a note
>to it that the problem can have a much more comprehensive solution and
>that the solution described, although simple, isn't the most efficient and
>in many ways defficient. This will shut up people like me who will read
>the PEP and say "But this is just the old lame CGI?".

I'll make sure this viewpoint is included when I do the next draft 
(probably this weekend).

It will probably be by saying something like, "this spec doesn't give the 
application any direct control over a container, and so may be 
unsatisfactory for some more-demanding applications.  In practice, such 
applications today must interact directly with a web server, as via 
mod_python, or via the internal API of a web server written in Python.  It 
is possible that future versions of this specification, or another 
specification, will address these more demanding needs.

"However, in the interests of providing the greatest good to the greatest 
number as soon as practical, this version of the specification will focus 
on simplicity and ease of implementation (to encourage rapid adoption), and 
high portability (to encourage widespread adoption).  Once this occurs, 
container and application/framework developers will be in a better position 
to define requirements for a complementary application-to-container 
interface to supplement this container-to-application interface."

Something like that, anyway.  I'll probably work that in with some of the 
threading stuff.

So far, there's going to be a new Goals/Scope section that'll deal with 
these and other scope issues that people found confusing.  There'll be a 
section added on threading and process model issues.  There'll need to be 
an expanded rationale regarding the whole dictionary thing.  And, I'll add 
a "Discussion and Dissention" section to cover the positive and negative 
feedback so far.