[Web-SIG] A Python Web Application Package and Format

Fri Apr 15 21:05:32 CEST 2011

On 2011-04-14 10:34:59 -0700, Ian Bicking said:

> I think there's a general concept we should have, which I'll call a 
> "script" -- but basically it's a script to run (__main__-style), a 
> callable to call (module:name), or a URL to fetch internally.

Agreed.  The reference notation I mentioned in my reply to Graham, with 
the addition of URI syntax, covers all of those options.

> I want to keep this distinct from anything long-running, which is a 
> much more complex deal.

The primary application is only potentially long-running.  (You could, 
in theory, deploy an app as CGI, but that way lies madness.)  However, 
the reference syntax mentioned (excepting URL) works well for 
identifying this.

> I think given the three options, and for general simplicity, the script 
> can be successful or have an error (for Python code: exception or no; 
> for __main__: zero exit code or no; for a URL: 2xx code or no), and can 
> return some text (which may only be informational, not structured?)

For the simple cases (script / callable), it's pretty easy to trap 
STDOUT and STDERR, deliver INFO log messages to STDOUT, everything else 
to STDERR, then display that to the administrator in some form.  Same 
for HTTP, except that it can include full HTML formatting information.

> An application configuration could refer to scripts under different 
> names, to be invoked at different stages.

A la the already mentioned post-install, pre-upgrade, post-upgrade, 
pre-removal, and cron-like.  Any others?

> There could be an optional self-test script, where the application 
> could do a last self-check -- import whatever it wanted, check db 
> settings, etc.  Of course we'd want to know what it needed *before* the 
> self-check to try to provide it, but double-checking is of course good 
> too.

Unit and functional tests are the most obvious.  In which case we'll 
need to be able to provide a localhost-only 'mounted' location for the 
application even though it hasn't been installed yet.

> One advantage to a separate script instead of just one 
> script-on-install is that you can more easily indicate *why* the 
> installation failed.  For instance, script-on-install might fail 
> because it can't create the database tables it needs, which is a 
> different kind of error than a library not being installed, or being 
> fundamentally incompatible with the container it is in.  In some sense 
> maybe that's because we aren't proposing a rich error system -- but 
> realistically a lot of these errors will be TypeError, ImportError, 
> etc., and trying to normalize those errors to some richer meaning is 
> unlikely to be done effectively (especially since error cases are hard 
> to test, since they are the things you weren't expecting).

Humans are potentially better at reading tracebacks than machines are, 
so my previous logging idea (script output stored and displayed to the 
administrator in a readable form) combined with a modicum of reasonable 
exception handling within the script should lead to fairly clear errors.

> Categorizing services seems unnecessary.

The description of the different database options were for 
illustration, not actual separation and categorization.

> I'd like to see maybe an | operator, and a distinction between required 
> and optional services.  E.g.:

No need for some new operator, YAML already supports lists.

services:
	- [mysql, postgresql, dburl]

Or:

services:
	required:
		- files

	optional:
		- [mysql, postgresql]

> And then there's a lot more you could do... which one do you prefer, 
> for instance.

The order of services within one of these lists would indicate 
preference, thus MySQL is preferred over PostgreSQL in the second 
example, above.

> Tricky things:
> - You need something funny like multiple databases.  This is very 
> service-specific anyway, and there might sometimes need to be a way to 
> configure the service.  It's also a fairly obscure need.

I'm not convinced that connecting to a legacy database /and/ current 
database is that obscure.  It's also not as hard as Django makes it 
look (with a 1M SLoC change to add support)… WebCore added support in 
three lines.

> - You need multiple applications to share data.  This is hard, not sure 
> how to handle it.  Maybe punt for now.

That's what higher-level APIs are for. ;)

> You mean, the application provides its own HTTP server?  I certainly 
> wouldn't expect that...?

Nor would I; running an HTTP server would be daft.  Running mod_wsgi, 
FastCGI on-disk sockets, or other persistent connector makes far more 
sense, and is what I plan.

Unless you have a very, very specific need (i.e. Tornado), running a 
Python HTTP server in production then HTTP proxying to it is 
inefficient and a terrible idea.  (Easy deployment model, terrible 
overhead/performance.)

> Anyway, in terms of aggregate, I mean something like a "site" that is 
> made up of many "applications", and maybe those applications are 
> interdependent in some fashion.  That adds lots of complications, and 
> though there's lots of use cases for that I think it's easier to think 
> in terms apps as simpler building blocks for now.

That's not complicated at all; I do those types of aggregate sites 
fairly regularly.  E.g.

/ - CMS
/location - Location & image database.
/resource - Business database.
/admin - Flex administration interface.

That's done at the Nginx/Apache level, where it's most efficient to do 
so, not in Python.

> Sure; these would be tool options, and if you set everything up you are 
> requiring the deployer to invoke the tools correctly to get everything 
> in place.  Which is a fine starting point before formalizing anything.

What?  Not even close—the person deploying an application is relying on 
the application server/service to configure the web server of choice; 
there is no need for deployer action after the initial "Nginx, include 
all .conf files from folder X" where folder X is managed by the app 
server.  (That's one line in /etc/nginx/nginx.conf.)

> Hm... I guess this is an ordering question.  You could import logging 
> and setup defaults, but that doesn't give the container a chance to 
> overwrite those defaults.  You could have the container setup logging, 
> then make sure the app sets defaults only when the container hasn't -- 
> but I'm not sure if it's easy to use the logging module that way.

The logging configuration, in dict form, is passed from the app server 
to the container.  The default logging levels are read by the app 
server from the container.  It's trivially easy, esp. when INI and YAML 
files can be programatically created.

> Well, maybe that's not hard -- if you have something like 
> silvercustomize.py that is always imported, and imported fairly early 
> on, then have the container overwrite logging settings before it *does* 
> anything (e.g., sends a request) then you should be okay?

Indeed; container-setup.py or whatever.

> Rich configurations are problematic in their own ways.  While the 
> str-key/str-value of os.environ is somewhat limited, I wouldn't want 
> anything richer than JSON (list, dict, str, numbers, bools).

JSON is a subset of YAML.  I honestly believe YAML meets the 
requirements for richness, simplicity, flexibility, and portability 
that a configuration format really needs.

> And then we have to figure out a place to drop the configuration.  
> Because we are configuring the *process*, not a particular application 
> or request handler, a callable isn't great (unless we expect the 
> callable to drop the config somewhere and other things to pick it up?)

I've already mentioned an environment variable identifying the path to 
the on-disk configuration file—APP_CONFIG_PATH—which would then be read 
in and acted upon by the container-setup.py file which is initially 
imported before the rest of the application.  Also, the application 
factory idea of passing the already read-in configuration dictionary is 
quite handy, here.

> I found at least giving one valid hostname (and yes, should include a 
> path) was important for many applications.  E.g., a bunch of apps have 
> tendencies to put hostnames in the database.

Luckily, that's a bad habit we can discourage.  ;)

> I'm not psyched about pointing to a file, though I guess it could work 
> -- it's another kind of peculiar 
> drop-the-config-somewhere-and-wait-for-someone-to-pick-it-up.  At least 
> dropping it directly in os.environ is easy to use directly (many things 
> allow os.environ interpolation already) and doesn't require any 
> temporary files.  Maybe there's a middle ground.

Picked up by the container-setup.py site-customize script.  What's the 
limit on the size of a variable in the environ?  (Also, that memory 
gets permanently allocated for the life of the application; not very 
efficient if we're just going to convert it to a rich internal 
structure.)

> :: Application (package) name.
> 
> This doesn't seem meaningful to me -- there's no need for a one-to-one 
> mapping between these applications and a particular package.  Unless 
> you mean some attempt at a unique name that can be used for indexing?

You're mixing something up, here.  Each application is a single primary 
package with dependencies.  One container per application.

> It would also need a way to specify things like what port to run on

Automatically allocated by the app server.

> public or private interface

Chosen by the deployer during deployment time configuration.

> maybe indicate if something like what proxying is valid (if any)

If it's WSGI, it's irrelevant.  If it's a network service, it shouldn't 
be HTTP.

> maybe process management parameters

For WSGI apps, it's transparent.  Each app server would have its own 
preference (e.g. mine will prefer FastCGI on-disk sockets) and the 
application will be blissfully unaware of that.

> ways to inspect the process itself (since *maybe* you can't send 
> internal HTTP requests into it), etc.

Interesting idea, not sure how that would be implemented or used, though.

> PHP! ;)

PHP can be deployed as a WSGI application.  :P

> I'm not personally that happy with how App Engine does it, as an 
> example -- it requires a regex-based dispatch.

Regex dispatch is terrible.  (I've actually encountered Python's 56KiB 
regular expression size limit on one project!)  Simply exporting 
folders as "top level" webroots would be sufficient, methinks.

> Anything "string-like" or otherwise fancy requires more support 
> libraries for the application to actually be able to make use of the 
> environment.  Maybe necessary, but it should be done with great 
> reluctance IMHO.

I've had great success with string-likes in WebCore/Marrow and 
TurboMail for things like e-mail address lists, e-mail addresses, and 
URLs.

	— Alice.