[Spambayes] Purging Old Words - Date vs Sequence

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 21:57:18 -0400


[Alexander G. M. Smith]
> While implementing a BeOS version of Paul Graham's spam detection
> algorithm (available at http://www.bebits.com/app/3055 - I'll be
> switching to Gary Robinson's algorithm soon),

I hear that Python runs on BeOS too -- feel free to use our engine.  Feel
even freer to contribute great ideas to it <wink>.

> I had a need to purge old words from the database.  More of a need
> than usual since I'm simplisticly considering the whole message and
> breaking it into simple words, even binary attachments.  Then
> deleting the unused binary garbage after a while.  I suppose that
> technique could even find spam encoded as pictures.

Probably not.  Compression is part of the gif and jpeg schemes, and the
better the compression the more the bytes act random (this is why your
favorite compression program saves very little when compressing these
guys -- it can't find structure to exploit).  There may be some value in
mining the first few bytes to capture "magic numbers", although I expect we
get much the same thing by generating tokens for the MIME decorations saying
which kind of binary gibberish a thing is, and the intended file name if
it's an attachment.

> I thought of using a date stamp, like the spambayes project does,
> but that could unevenly remove messages since they aren't added
> at uniform times.

Heh.  AFAIK, nobody has even *tried* calling our clearjunk() method yet.
That's all up in the air, and I'm hoping somebody else will be its savior.
Note, though, that the timestamps are updated by our engine during
*prediction*:  it records the last time a word was accessed.  That's equal
to the creation time if and only if a word has never been seen again in a
message scored.  Provided the prediction part is run routinely, then, the
words actually being used by prediction will have current timestamps.
Consuming an 8-byte float per word for this purpose is absurd, though.

> Instead, I assign a sequentially increasing serial number to each
> example of spam or ham, and store that along with the words and
> frequency counts.  If the word is in a later example message, the
> serial number of the new example replaces the old one for that word.
> Then I can purge words which appeared N messages before the latest
> one (usually in combination with having a low frequency count).

All expiration schemes need testing, because they're cheating:  word counts
are added on a per-msg basis.  Expiration that doesn't also work on a
per-msg basis is thus a distortion of real life.  Whether that's harmful in
this scheme isn't known.  For example, msg 1 contains words A and B.  They
each get a count of 1.  Msg 2 contains words B and C.  Now we have

       count  serial
    A    1       1
    B    2       2
    C    1       2

Now we decide message #1 is stale, but the only word still having serial 1
is A.  This leaves us with B=2 and C=1, which doesn't match any previous
real state.  It isn't obvious to me that this won't destroy performance over
time; but it isn't obvious to me that it will hurt it, either.  I find it
easier to test such things than think about them, though <wink>.

> I suppose you could even factor in the kill count or have a last used
> to kill spam date/serial number too.

Our killcount was put in to help research more than anything else.  It
turned out to be *great* for identifying changes that gave great results for
bad reasons (e.g., your ham goes back 5 years, but all your spam was
collected in the last month -- then clues about dates, or even the ISP you
were using at various times, become extremely strong and frequent
discriminators, but for bad reasons -- and their high killcounts make this
obvious).

> Anyway, that's the only novel idea I've had on the topic,

Expiration will be very important over time.  Keep thinking!  Even if you
don't tokenize huge masses of random binary gibberish, it's common in all
computer indexing applications that databases grow without bounds (typos,
message ids, random bits of fly-by-night URLs, ..., they pile up endlessly).

> everything else I've thought of has been covered here on the mailing
> list (well, except for the pretty graphics display of the word list).
>
> Thanks to all for moving spam detection forward so much, and doing
> all that tedious experimental testing to find the best settings.

The best way to thank the folks here is to contribute your own tedious
experimental testing back to us -- we don't have enough of that, and there's
no end in sight for things that should be tested.  Pull up a comfy chair and
get obsessed <wink>.