[Web-SIG] HEAD requests, WSGI gateways, and middleware

Fri Jan 25 03:51:21 CET 2008

Graham Dumpleton wrote:
> The issue here is that Apache has its own output filtering 
> system where filters can set headers based on the actual 
> content. Because of this, any output filter must always 
> receive the response content regardless of whether the 
> request is a GET or HEAD. If an application handler tries to 
> optimise things and not return the content, then these output 
> filters may generate different headers for a HEAD request 
> than a GET request, thereby violating the requirement that 
> they should actually be the same.
> 
> Note that response content is still thrown away for a HEAD 
> request, it is just done at the very last moment after all 
> Apache output filters have processed the data.

Right, that is exactly what I am saying. In Apache's documentation, it
says that every handler should include the response entity for HEAD
requests, so that output filters can process the output. However, there
is nothing in PEP 333 that talks about this behavior. So, the only
reasonable thing to do is to assume that, when environ["REQUEST_METHOD"]
== "HEAD", no response entity should be generated. Do we all agree that
the following application is correct?:

	def application(env, start_response):
		start_response("200 OK",
			[("Content-Length", "10000")])
		if env["REQUEST_METHOD"] == "HEAD":
			return []
		else:
			return ["a"*10000]

Because of web servers' output filters, if the WSGI gateway is an web
server module or a [Fast]CGI script, then it needs to lie and tell the
application that the request is a "GET", not a "HEAD." Otherwise, the
application will see that the request method is "HEAD" and suppress its
own response entity, as the HTTP specification requires, and the output
filters will fail. The only time it is reasonable for the gateway to
pass "HEAD" as the request method is when it knows that there are not
any output filters/middleware that depend on the response entity.
Usually that is only possible in standalone web servers like CherryPy's
or Paste's.

I tested this in mod_wsgi and mod_wsgi gets it wrong. mod_wsgi sets
env["REQUEST_METHOD"] to "HEAD" for HEAD requests. When mod_deflate is
enabled, a HEAD request returns "Content-Length: 20", and a GET request
returns "Content-Length: 46". However, it is supposed to be
"Content-Length: 46" in both cases. The CGI WSGI gateway in PEP 333 gets
it wrong too when mod_deflate is used.

Note also that in mod_wsgi, use of wsgi.file_wrapper is a huge
optimization for this: if no Apache output filters need the response
entity, and wsgi.file_wrapper is used, then the file will never be read
off the disk. But, if wsgi.file_wrapper is not used, then the entire
file has to be read off the disk through the application's output
iterable for no reason. It would be nice if the non-file_wrapper case
worked as well as the file_wrapper case.

If you put all this together, you end up with the rules that I outlined
in my previous message:

> 1. WSGI gateways must always set environ["REQUEST_METHOD"] to
>    "GET" for HEAD requests. Middleware and applications will
>    not be able to detect the difference between GET and HEAD 
>    requests.
>
> 2. For a HEAD request, A WSGI gateway must not iterate
>    through the response iterable, but it must call the
>    response iterable's close() method, if any. It must not
>    send any output that was written via
>    start_response(...).write() either. Consequently,
>    WSGI applications must work correctly, and must not
>    leak resources, when their output is not iterated;
>    an application should not signal or log an error if
>    the iterable's close() method is invoked without any
>    iteration taking place.

- Brian