tuning our site (was: help! advocacy resources needed fast)

Tue Mar 11 15:23:51 EST 2003

Quoting Kyler Laird (Kyler at news.Lairds.org):
> Geoff Gerrietts <geoff at gerrietts.net> writes: > 
> >We have spent a year refactoring key components, and building
> >caching solutions to minimize the impact of load.
> 
> I realize you're trying to solve your problems now, but I'd enjoy
> hearing more about this.

I promised I'd say something more about this when the immediate crisis
passed, and now I will.

Our year spent has not been spent entirely spent doing performance
tweaks. The developer giveth and the developer taketh away, just as
management before him: we have held to an aggressive schedule of new
features and feature revisions while simultaneously devoting
considerable resources to "optimizing" our site.

Now, there's a lot that's atypical about our architecture, most of
which I won't go into. We're on an old version of Zope with lots of
little patches. We're on an old version of Python with a couple little
patches. And we're using ILU to distribute our app. All of these
things are admittedly suboptimal, but it's been hard to make a
business case for upgrading, because performance gains aren't
expected. (That attitude is changing now that it's getting very
difficult to install the old software on new hardware).

In the meantime, we've tried a lot of things, and we have other things
on our plates to try. I'll list a few of them now.

- When we passed 12 Zope boxes, we put 2 ZEO servers into play and
  split the load between them. That did mean we had two places to
  update, but it reduced the load on the ZEO server considerably, and
  brought overall performance up, especially right after Zope was
  cycled.

- We identified certain static content pages that we could extract and
  cache directly on the filesystem.

- We pulled all our images out of Zope and put them on separate image
  servers. We rewrote our Image product so that, depending on the
  environment, we either retrieved the images from Zope directly
  (during development) or from our image server. The solution
  generalizes, but the code does not.

- In some cases, photos were being uploaded into our oracle database.
  For these cases, we rewrote things so that we maintained an image
  cache on the image servers, and used Apache's mod_rewrite to fetch
  the original image from Zope if we didn't find it in the cache.

- Later, we cooked a similar solution for dynamically-rendered
  (co-branded) pages with mostly static content. We could configurably
  list base URLs, then cache them on a per-partner basis.

- We replaced the asyncore module with a C extension that provided the
  same functionality.

- We systematically tuned the boxes Zope was running on, reducing the
  processes that were running to bare minimums.

- We stopped using Zope's built-in monitoring, because the more often
  we hit the monitoring, the more likely it was Zope would hang or
  crash. This would probably be alleviated by updating Zope, but I'm
  not sure.

Things we still have yet to try:

- Put ZEO's cache on a ramdisk instead of hdd. Zope responds much
  worse to loads late in the day than it does early in the day, after
  the log roll and cache clear. We suspect this is because looking
  through the 20+MB ZEO cache for content hits the disk pretty hard.

- Roll logs more often. The logs get big, and appending to big files
  takes longer.

- Replace pieces of DTML with Python. We're considering several ways
  of going at this, varying from using Cheetah to hand-coding the
  Python: it's a little tricky because we have web designer types who
  do all the HTML tuning and maintenance, and we have Python guys, and
  there's limited overlap between the sets of expertise.

- Tweak number of Zope threads to see if we can manipulate
  performance. This isn't expected to go much of anywhere.

- Try running multiple Zope instances on boxes with multiple CPUs, to
  see if we can sidestep the GIL limitation on scaling up with
  multiple CPUs.

- Try replacing apache with squid and compare performance.

- Rip enough of Zope's guts apart that we can do real profiling again,
  and identify the expensive operations. Rewrite what we can in C, try
  to find workarounds for what we can't.

We have other ideas but I don't remember what they are. Most aren't
terribly technically involved, they're more like tuning things. We've
also tried a lot of things to see how they played out, and some made
no appreciable difference, so we didn't bother implementing them. And
I'm pretty sure we did a bunch of things that I either didn't hear
about, or have forgotten.

We did make a pretty big dent in our performance problems with some of
these. In particular, caching pages that could be served statically,
and getting a commitment from the product group that the most critical
of those pages could remain static, made the biggest difference.

All simple stuff, the stuff you try first. The hard stuff, that's
what's left.

--G.

-- 
Geoff Gerrietts             "I am always doing that which I can not do, 
<geoff at gerrietts net>     in order that I may learn how to do it." 
http://www.gerrietts.net                    --Pablo Picasso