[spambayes-dev] Re: [Spambayes] fatal error?

Skip Montanaro skip at pobox.com
Tue Aug 26 13:32:38 EDT 2003


    >> I'll check with Sleepycat, but it seems to me that the most expedient
    >> course would be to acquire a lock around database accesses.

    Tim> Brrrr.  Running a Berkeley backend is already soooooo much slower
    Tim> than running from a dict.  I didn't really notice that until the
    Tim> SoBig worm turds starting swamping my inbox, but after a few days
    Tim> of that I switched back to using a pickled dict.  Adding a lock
    Tim> around each stinkin' access is a good way to soak up excess cycles,
    Tim> anyway <wink>.

I suspect that the Outlook plugin simply makes it easier to find problems
(more users, more worm mail, more concurrent threads, whatever).  I think
the same (or a similar) problem would exist were two instances of
hammiefilter running at the same time, both trying to update the file.  I'm
just fortunate enough to have never encountered that problem.  Even using a
pickle, you really ought to use some sort of lock protocol when reading or
writing the pickle file if there's any chance of concurrent access by
another process or thread.  That you only read it at the beginning and write
it at the end only limits the opportunity for collision.

I just (re)ran a little experiment.  (I'm sure we've done this in the past.)
I took my current hammie.db (153685 keys, no hapaxes, the result of
processing 11,000+ hams and 8,000+ spams) and converted it to a pickle using
dbExpImp.  Startup time is dramatically different:

    % time python -c 'import pickle ; db = pickle.load(open("hammie.pck"))'

    real    0m32.193s
    user    0m22.850s
    sys     0m0.430s
    % time python -c 'import cPickle ; db = cPickle.load(open("hammie.pck"))'

    real    0m5.650s
    user    0m3.720s
    sys     0m0.350s
    % time python -c 'import shelve ; db = shelve.open("hammie.db")'

    real    0m0.155s
    user    0m0.050s
    sys     0m0.050s

This is not to imply that my huge database is typical or that my usage of
hammiefilter is either.  Using pickles for moderately sized training
databases would probably work, regardless of the application.  With
long-running SB apps like the Outlook plugin or pop3proxy, pickles are
probably the way to go.  (Maybe it's time to give up on hammiefilter
altogether.)

Skip



More information about the spambayes-dev mailing list