[Spambayes] Purging Old Words - Date vs Sequence

Alexander G. M. Smith agmsmith@rogers.com
Sun, 29 Sep 2002 17:26:02 EDT (-0400)


While implementing a BeOS version of Paul Graham's spam detection
algorithm (available at http://www.bebits.com/app/3055 - I'll be
switching to Gary Robinson's algorithm soon), I had a need to purge
old words from the database.  More of a need than usual since I'm
simplisticly considering the whole message and breaking it into
simple words, even binary attachments.  Then deleting the unused
binary garbage after a while.  I suppose that technique could even
find spam encoded as pictures.

I thought of using a date stamp, like the spambayes project does,
but that could unevenly remove messages since they aren't added
at uniform times.  Instead, I assign a sequentially increasing
serial number to each example of spam or ham, and store that
along with the words and frequency counts.  If the word is in
a later example message, the serial number of the new example
replaces the old one for that word.  Then I can purge words which
appeared N messages before the latest one (usually in combination
with having a low frequency count).  I suppose you could even
factor in the kill count or have a last used to kill spam
date/serial number too.  Anyway, that's the only novel idea I've
had on the topic, everything else I've thought of has been
covered here on the mailing list (well, except for the pretty
graphics display of the word list).

Thanks to all for moving spam detection forward so much, and doing
all that tedious experimental testing to find the best settings.

- Alex