[Web-SIG] Implementing File Upload Size Limits

Tue Nov 25 23:59:10 CET 2008

2008/11/26 Brian Smith <brian at briansmith.org>:
> Randy Syring wrote:
>> Hopefully you can clarify something for me.  Lets assume that the
>> client does not use '100 Continue' but sends data immediately, after
>> sending the headers.  If the server never reads the request content,
>> what does that mean exactly?  Does the data get transferred over the
>> wire but then discarded or does the client not get to send the data
>> until the server reads the request body?  I.e. the client tries to
>> "send" it, but the content isn't actually transferred across the
>> wire until the server reads it.  I am just wondering if there
>> is a buffer or queue or something between the server and the client
>> that allows data to be transferred even if the server doesn't
>> "read" the request body.  Or, is it just like a straight pipe
>> where one end (the client) can't push data through until the other
>> end (the server) reads it.
>
> Under Apache CGI or mod_wsgi, in many situations you will get a deadlock in
> this scenario. The input and the output are buffered separately both of
> those buffers can fill up.

It isn't 'many situations', it is a quite specific situation.

The issue applies only to mod_wsgi daemon mode and only occurs where
the size of the request content body size is larger than the UNIX
socket buffer size for that platform and the WSGI application doesn't
consume all the request body. At the same time, the WSGI application
would then have to return a set of response headers and response body
which combined are also larger than the UNIX socket buffer size for
that platform.

> Neither mod_wsgi nor mod_cgid implement the
> non-blocking I/O logic needed to prevent deadlocks.

Both mod_wsgi and mod_cgi do have timeouts so that a permanent
deadlock situation at least doesn't arise. This is based off standard
Apache Timeout directive. AFAIK I know mod_cgid still has bug in it
whereby it doesn't detect it and so possibly easy way to DOS an Apache
server.

As far as changing how mod_wsgi works, there exists the issue:

  http://code.google.com/p/modwsgi/issues/detail?id=56

It is low priority though as no one has been reporting it as a problem
in actual use. Scenarios where it technically might be triggered would
generally be SPAM bots trying to POST large amounts of data to
arbitrary URLs. If an application is function as intended, the
situation shouldn't really arise as POST requests should be getting
directed at URLs which will consume it.

That issue also references the IIS+CGI issue someone else mentioned:

  http://www.doxdesk.com/updates/2006.html#u20060416-cgi

FWIW, mod_scgi also has same problem and it doesn't implement timeouts
so can suffer permanent deadlock.

> I heard (but did not
> verify) that mod_fastcgi does not have this deadlocking problem. The sizes
> of the buffers determines the size of the inputs and outputs needed to cause
> a deadlock. On some platforms (e.g. Mac OS X), they are only 8K by default.

MacOS X is only system I know of that has small default UNIX socket
buffer sizes. This small buffer size only applies to UNIX socket
buffer sizes, for INET sockets it is much much larger. Since
mod_fastcgi predominantly uses INET sockets, if there is an issue it
may not be obvious as you would need to be returning very large
response. From what I remember when I looked at mod_fastcgi and
mod_proxy for certain types of operations they both try and force all
request content down the socket before trying to read response. Thus,
am not convinced that problem couldn't actually occur for both of
these as well, but since INET socket buffer size much much larger, not
generally triggered.

To work around UNIX socket buffer size on mod_wsgi, there are options
which can be supplied to WSGIDaemonProcess to change the UNIX socket
buffer sizes used to something more sensible.

> Therefore, for maximum portability, a WSGI application should ALWAYS consume
> the *whole* request body if it wants to avoid the deadlock using the
> reference WSGI adapter in PEP 333 or mod_wsgi.
>
> Probably other WSGI gateways have similar issues. It would be nice if there
> was a standard entry in the WSGI environment (e.g.
> "wsgi.may_ignore_request_body") that could be used to safely detect when we
> can skip the request body. It would be even nicer if WSGI gateways were
> updated to avoid this problem. However, that is easier said than done.
>
> If you know C, it is relatively simple to modify mod_wsgi to use a different
> Apache<->daemon communication protocol so that the daemon mode works as you
> would expect (no deadlocks, proper 100-continue support, request body isn't
> read unless your application asks for it). A long time ago I had a patch
> that did this (among other things) but I don't think I have it any more.

Depends on your definition of simple. It would be quite fiddly to do
and get right, or one would have to rewrite a large amount of code. I
wouldn't regard either as really that simple.

> However, once you get to that point, you still run into problems. If your
> goal is to avoid reading the request body, then you need to close the
> connection in your error response; Otherwise, if the request was a HTTP/1.1
> request, you still need to read the entire request body in order to process
> any requests that follow it in the request pipeline. Unfortunately, a WSGI
> application doesn't have any way of signaling that the connection is to be
> closed; the WSGI specification forbids the WSGI application from returning
> the Connection header since it is hop-by-hop. And, even if there was such a
> mechanism, a poorly-coded client is likely to still cause a deadlock if the
> server doesn't read its full request. Make sure you test with all your
> targeted browsers.

Apache, and I would expect any sensible web server, always closes a
client connection when error responses are returned. Thus it will only
allow request pipelining so long as 200 response is returned. Okay, it
isn't this simple as Apache looks at lots of other things as well, but
close enough.

The WSGI specification may forbid returning Connection header, but if
you do do it with mod_wsgi, then Apache will note it and close the
connection even if 200 response is returned.

Graham

> Consequently...
>
>> > If you are using daemon mode however,
>> > then the request content would always be read by Apache child worker
>> > process, even if client asked for '100 Continue' response. This is
>> > because the Apache child worker process will always proxy request
>> > content to the daemon process.
>> >
>> Thats good to know.  I think at this point I have talked myself into
>> thinking that there is no good reason to handle it at the application
>> level, but would appreciate any further feedback you might have.
>
> ...if your users will often attempt to upload large files exceed your
> limits, is to best to mitigate the problem on the client-side. First,
> document the file size limit clearly on the page where the upload happens.
> Secondly, implement a flash-based and/or java-based file upload control that
> can be used when the user has Flash installed (fall back to the regular
> control otherwise). With such an uploader, you can check the file size on
> the client and prevent these requests from even being made (in the typical
> case). You will still have to implement the validation logic on the server
> to prevent malicious use and/or disabled Javascript/Flash/Java. There are
> additional benefits to this approach (better UI, multi-file selection,
> compression, encryption, doesn't waste the user's time, saves bandwidth) but
> it comes with all the drawbacks inherent with Flash/Java/Javascript.
>
> Regards,
> Brian
>
>
>