[Web-SIG] Implementing File Upload Size Limits

Fri Nov 28 00:15:17 CET 2008

2008/11/28 Robert Brewer <fumanchu at aminus.org>:
> Brian Smith wrote:
>> Randy Syring wrote:
>> > Hopefully you can clarify something for me.  Lets assume that the
>> > client does not use '100 Continue' but sends data immediately, after
>> > sending the headers.  If the server never reads the request content,
>> > what does that mean exactly?  Does the data get transferred over the
>> > wire but then discarded or does the client not get to send the data
>> > until the server reads the request body?  I.e. the client tries to
>> > "send" it, but the content isn't actually transferred across the
>> > wire until the server reads it.  I am just wondering if there
>> > is a buffer or queue or something between the server and the client
>> > that allows data to be transferred even if the server doesn't
>> > "read" the request body.  Or, is it just like a straight pipe
>> > where one end (the client) can't push data through until the other
>> > end (the server) reads it.
>>
>> Under Apache CGI or mod_wsgi, in many situations you will get a
>> deadlock in
>> this scenario. The input and the output are buffered separately both
> of
>> those buffers can fill up. Neither mod_wsgi nor mod_cgid implement the
>> non-blocking I/O logic needed to prevent deadlocks. I heard (but did
>> not
>> verify) that mod_fastcgi does not have this deadlocking problem. The
>> sizes
>> of the buffers determines the size of the inputs and outputs needed to
>> cause
>> a deadlock. On some platforms (e.g. Mac OS X), they are only 8K by
>> default.
>>
>> Therefore, for maximum portability, a WSGI application should ALWAYS
>> consume
>> the *whole* request body if it wants to avoid the deadlock using the
>> reference WSGI adapter in PEP 333 or mod_wsgi.
>
> Indeed. This is covered in RFC 2616 Section 8.2.3:
>
>    If an origin server receives a request that does not include an
>    Expect request-header field with the "100-continue" expectation,
>    the request includes a request body, and the server responds
>    with a final status code before reading the entire request body
>    from the transport connection, then the server SHOULD NOT close
>    the transport connection until it has read the entire request,
>    or until the client closes the connection. Otherwise, the client
>    might not reliably receive the response message. However, this
>    requirement is not be construed as preventing a server from
>    defending itself against denial-of-service attacks, or from
>    badly broken client implementations.
>
> CherryPy's wsgiserver will read any remaining request body (which the
> application hasn't read) before sending response headers.

A WSGI application could technically want to send response headers and
only then read remaining request content. I don't believe there is
anything in the WSGI specification which prevents that. If you are
discarding the request content as soon as response headers are
generated, that could technically be a problem for some use cases,
even if they may be obscure.

I cant tell from looking at latest CherryPy WSGI server code as has
been changed since last I looked at it and haven't yet had time to
grok it and run some tests, but previously in respect of where WSGI
specification says:

"""The server is not required to read past the client's specified
Content-Length, and is allowed to simulate an end-of-file condition if
the application attempts to read past that point."""

the CherryPy WSGI server code chose NOT to simulate an end-of-file
condition. This was the case as the amount of data read from
wsgi.input was never tracked. This meant that if application did try
and read more content than available and request pipelining occurring
then the read would hang as would not get an empty string returned as
would be normal for end-of-file condition for file like object.

If the code is still behaving this way, then it wouldn't be possible
for it to discard remaining input as how much was read wasn't tracked.

Looking at latest code I do note the presence of a wrapper around
socket used for wsgi.input, but haven't been able to work out yet
whether it returns a traditional empty string as end-of-file
condition, or whether it is going to instead raise your
MaxSizeExceeded exception and thus not be file like in it behaviour.

Can you perhaps explain what is going to happen when an attempt is
made to read more content than what was available and whether it is
actually going to raise an exception rather than just return an empty
string like file like objects would.

Personally I think that that part of WSGI specification should be
amended such that it is required that an end-of-file condition MUST be
indicated using an empty string just like with normal file like
objects. Just this one change would mean that one could call read()
with no arguments and have it return all input, whereas at the moment
WSGI specification does allow argument to read() be optional.

This would actually negate the whole need for applications to even
check/use CONTENT_LENGTH except for situations where it mattered such
as 413 response or where how it decided to process it was dependent on
size. That is, to get all request content you would just call read()
with no argument. If you wanted to process it in chunks, then it would
just loop reading a set chunk size until empty string returned and it
wouldn't need to track how much it read and short read the last chunk.
If applications worked this way then one could handle mutating input
filters that changed amount of request content, ie., decompression of
data, plus could handle chunked transfer encoding on request content
in a reasonable way without having to read it all in and buffer it
just to work out CONTENT_LENGTH.

Up till now, the only major WGSI server (ignoring wsgiref perhaps) I
knew of which didn't allow read() with no argument or which didn't
simulate end-of-file through empty string being returned was CherryPy
WSGI server. Now its code has been changed, but not sure if it still
does that or whether it has done something totally different to
everything else by raising an exception instead.

Graham