[Mailman-Developers] The Hylton Archiver Band-aid

Barry A. Warsaw bwarsaw@cnri.reston.va.us (Barry A. Warsaw)
Sat, 21 Aug 1999 01:50:21 -0400 (EDT)


I've just checked in a slew of changes affecting the archiver,
hopefully fixing the more serious of the nasty performance problems
we've been seeing.  Jeremy Hylton deserves a lot of credit for his
excellent analysis and patching of the code.  He called his changes "a 
band-aid" because it's clear the archiver could still use more
improvement, and in fact should go through a simplification and
partial rewrite.  We don't have the time for that now, so we'll go
with the Hylton Band-aid for now.

I hope I can accurately outline the changes, but it's late and I'm
tired so I might miss something.  Please check the CVS log messages
for details.  I really hope some of you adventurous types will try
running with these new changes.  I intend to put them up on python.org
(perhaps this weekend, or Monday) to see if it fixes the performance
problems I've been seeing there.

First of all, Jeremy noticed that the way the archiver's .txt.gz file
is created is very inefficient.  It essentially reads the .txt and
uses Python's gzip module to write the .txt.gz file -- for /every/
message that gets posted!  This is a lot of work for not much
benefit.  So the first change is that the .gz file is not created by
default (see mm_cfg.GZIP_ARCHIVE_TXT_FILES).  We need to work in a
crontab entry that gzip's the file nightly.  Yes this means that the
.txt.gz file will lag behind the on-line archive, but that should be a 
worthwhile tradeoff for better performance.

Jeremy also changed the DumbBTree implementation so that the keys are
not sorted until some client starts to iterate through the data
structure.  This saves work when adding the initial elements.
Remember a while back I added a clear() method to do a bulk clear of
the items and this saved a lot of work.

Finally, Jeremy made some observations about the cache in the
HyperDatabase.  He says that since it traverses the elements in linear 
order, the lack of locality of reference means that evicting items
from the cache doesn't really help, and in fact might hurt
performance.  So we now keep all the items in the cache (trade space
for time).  It might be worthwhile to get rid of the cache altogether, 
although it does serve a useful purpose currently.  The DumbBTree is
essentially a dictionary of pickles, and this whole structure is then
marshaled.  The cache keeps a hold of the unpickled objects.  It might 
make sense then to make the DumbBTree a simple dictionary and just
pickle it directly.  Then the cache wouldn't be needed.

Jeremy has some other ideas about how to improve the archiver.  I'm
way too tired to outline them here.  Jeremy will be out of the office
for a week, so hopefully he'll be able to restore enough state when he 
gets back to post his ideas.

G'night, :)
-Barry