[Web-SIG] Implementing File Upload Size Limits

Tue Nov 25 18:03:22 CET 2008

Randy Syring wrote:
> Hopefully you can clarify something for me.  Lets assume that the
> client does not use '100 Continue' but sends data immediately, after
> sending the headers.  If the server never reads the request content,
> what does that mean exactly?  Does the data get transferred over the
> wire but then discarded or does the client not get to send the data
> until the server reads the request body?  I.e. the client tries to
> "send" it, but the content isn't actually transferred across the
> wire until the server reads it.  I am just wondering if there
> is a buffer or queue or something between the server and the client
> that allows data to be transferred even if the server doesn't
> "read" the request body.  Or, is it just like a straight pipe
> where one end (the client) can't push data through until the other
> end (the server) reads it.

Under Apache CGI or mod_wsgi, in many situations you will get a deadlock in
this scenario. The input and the output are buffered separately both of
those buffers can fill up. Neither mod_wsgi nor mod_cgid implement the
non-blocking I/O logic needed to prevent deadlocks. I heard (but did not
verify) that mod_fastcgi does not have this deadlocking problem. The sizes
of the buffers determines the size of the inputs and outputs needed to cause
a deadlock. On some platforms (e.g. Mac OS X), they are only 8K by default. 

Therefore, for maximum portability, a WSGI application should ALWAYS consume
the *whole* request body if it wants to avoid the deadlock using the
reference WSGI adapter in PEP 333 or mod_wsgi. 

Probably other WSGI gateways have similar issues. It would be nice if there
was a standard entry in the WSGI environment (e.g.
"wsgi.may_ignore_request_body") that could be used to safely detect when we
can skip the request body. It would be even nicer if WSGI gateways were
updated to avoid this problem. However, that is easier said than done.

If you know C, it is relatively simple to modify mod_wsgi to use a different
Apache<->daemon communication protocol so that the daemon mode works as you
would expect (no deadlocks, proper 100-continue support, request body isn't
read unless your application asks for it). A long time ago I had a patch
that did this (among other things) but I don't think I have it any more. 

However, once you get to that point, you still run into problems. If your
goal is to avoid reading the request body, then you need to close the
connection in your error response; Otherwise, if the request was a HTTP/1.1
request, you still need to read the entire request body in order to process
any requests that follow it in the request pipeline. Unfortunately, a WSGI
application doesn't have any way of signaling that the connection is to be
closed; the WSGI specification forbids the WSGI application from returning
the Connection header since it is hop-by-hop. And, even if there was such a
mechanism, a poorly-coded client is likely to still cause a deadlock if the
server doesn't read its full request. Make sure you test with all your
targeted browsers.

Consequently...

> > If you are using daemon mode however,
> > then the request content would always be read by Apache child worker
> > process, even if client asked for '100 Continue' response. This is
> > because the Apache child worker process will always proxy request
> > content to the daemon process.
> >
> Thats good to know.  I think at this point I have talked myself into
> thinking that there is no good reason to handle it at the application
> level, but would appreciate any further feedback you might have.

...if your users will often attempt to upload large files exceed your
limits, is to best to mitigate the problem on the client-side. First,
document the file size limit clearly on the page where the upload happens.
Secondly, implement a flash-based and/or java-based file upload control that
can be used when the user has Flash installed (fall back to the regular
control otherwise). With such an uploader, you can check the file size on
the client and prevent these requests from even being made (in the typical
case). You will still have to implement the validation logic on the server
to prevent malicious use and/or disabled Javascript/Flash/Java. There are
additional benefits to this approach (better UI, multi-file selection,
compression, encryption, doesn't waste the user's time, saves bandwidth) but
it comes with all the drawbacks inherent with Flash/Java/Javascript.

Regards,
Brian