[Spambayes] expiration ideas.

Anthony Baxter anthony@interlink.com.au
Mon Oct 21 05:22:47 2002


>>> "Alexander G. M. Smith" wrote
> Anthony Baxter wrote:
> >   Keep the "interim" wordinfo around (gzipped, datestamped) until your
> >   expiration time is up - then undo the earlier merge, subtracting
> >   the spamcount/hamcounts. 
> Sounds reasonable.  But I'd rather keep around the whole messages so
> that I can change tokenizing schemes.  Or perhaps use one of those
> future inter-word relation schemes.

That's fine, but once this stuff is deployed, how many end-users are
going to want to tweak their tokeniser? I'd suggest approximately
three eighth's of one fifth of bugger-all :)

> The total space is several times (ten times) more than a word list
> (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB
> bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is
> still almost trivial on today's computers and huge disk drives to
> store the complete messages.  So, you have to ask yourself if a
> 10X space (and tokenizing time) savings is worth it.

For one user, fine - but in a setting where you've got multiple
users, say, using an IMAP server? You'd want the stuff to happen
on the server, before the end users have to run a program to
download the mail, check it, and send commands to the IMAP server
to move the spam out of the way...

I also get enough email that I really don't want to be lugging 
around all of my old email for a couple of months...

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.