[SciPy-dev] Scipy server suffering again...

Thu Jan 10 15:35:14 EST 2008

Hello,

I believe I have made some some good short-term improvements regarding 
the load on the SciPy.org server.  It should be much peppier now and I 
will continue to monitor the effect of the work-around.  If you don't 
care about the details then stop reading here.  The following technical 
details explain the problem, the short-term workaround, and why the 
long-term fix is best for those who care.

The main problem is that the system was starved for memory and swapping 
excessively.  The high-load and low CPU usage is a result of processes 
waiting on free memory.  The "high-memory usage"(1) is a basic result 
using garbage collected daemon processes that don't get restarted 
regularly, in our case FastCGI Python daemons.  In particular, one 
virtual hosts uses "Zope" with a large amount of content and was using 
45% of the memory alone.  I restarted this FastCGI daemon and it went 
down to 10% memory usage, though as I write this it's just under 20% 
usage.  Over time it'll go back up to around 40%, but I expect it to 
stabilize around that and go no higher as it reaches a stable heap size.

Now what's happening is largely the result of heap memory 
fragmentation(2).  Garbage collected languages tend to fragment their 
heap more than non-garbage collected languages, but with both it is 
expected that there is a critical heapsize threshold that once reached 
will satisfy all out-going and in-going heap requests without the heap 
having to grow further.  This is different than a memory leak where 
memory consumption will grow indefinitely.  Where this threshold is 
depends on a variety of factors such as typical workload, but it can be 
empirically measured.  Thus the long-term solution is to migrate to 
hardware that has enough memory to fit stable-sized heaps for all the 
Python daemons into but this will take a lot of time, effort, and 
testing so it's weeks out.  The short-term solution is to periodically 
restart the Python daemon processes before they reach max heap 
fragmentation.  However, restarting the daemons severs existing 
connections users may have and will likely erase any session state that 
isn't stored in their local web-browser so it is thus not a desirable 
long-term solution. 

Right now I'm measuring how fast memory gets fragmented so I can 
determine the maximum interval to use in a script to restart these 
processes automatically.  ie I may only need to restart them once every 
few days instead of once per day to minimize severed connections.

(1) High is a relative term here.  On the scale of modern servers it's 
not that high, but it's high for this particular hardware.
(2) 
http://en.wikipedia.org/wiki/Heap_fragmentation#External_fragmentation 
(basic introduction to heap fragmentation)

Continue to let me know if you have problems, conversely, let me know if 
you're having less problems than you've had recently.  Both are good to 
know.

Cheers,

J. Ryan Earl
IT Administrator
Enthought, Inc.
512.536.1057

Fernando Perez wrote:
> Howdy,
>
> I keep on getting, frequently, the by now familiar
>
> """Internal Server Error
>
> The server encountered an internal error or misconfiguration and was
> unable to complete your request.
> """
>
> so doing anything around the site, using trac, moin, etc, is becoming
> rather difficult.  I just noticed a load average on the box around 16,
> though no process is consuming any significant amount of CPU.
>
> If there's anything on our side (the individual project admins) we can
> do to help, please let us know.
>
> Cheers,
>
> f
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>
>