[Spambayes] full o' spaces

Sun Mar 9 08:38:23 EST 2003

    Tim> Ok, I train on virtually every piece of mail that comes into my
    Tim> notes inbox.  the ratio is about 10:1 spam:ham.  I currently have
    Tim> about 600 spam trained into the database.  I still get maybe
    Tim> 10%-15% unsure, invariably on spam.  I virtually never have a FP.
    Tim> Maybe I just need to adjust the spam cutoff...  Mainly thinking out
    Tim> loud, and bemoaning the fact that I've annoyed my namesake.

Tim,

I know your Notes environment may not allow this, but I do a couple things
to minimize the number of duplicate postings that ever get considered.  At
the very start of my .procmailrc file I remove messages with a message-id
I've seen recently:

    # make sure we don't get two copies of the same message
    :0 Wh: msgid.lock
    | $FORMAIL -D 16384 $HOME/tmp/msgid.cache

Later, after a message has been determined to be spam, I run my loose
checksum script and dump the message if it looks the same as a previous
spam:

    :0
    * ^X-Spambayes-Classification: spam
    {
        ### this recipe gobbles items with matching body checksums (taken
        ### loosely to try and avoid obvious tricks)
        :0 W: cksum.lock
        | $PYCKSUM -v $HOME/tmp/cksum.cache

        :0:
        $SPAM
    }

If I didn't take these steps I'm sure I'd get more spam (and probably see
more mistakes).  Since building my initial large training set, I have
generally only trained on mistakes and unsures.  Accordingly, I have about
12,000 saved hams and 7,000 saved spams.  If the code changes I retrain
completely, but generally only retrain on new messages.

I think either of these techniques (message-id caching and loose checksums)
could be incorporated into pop3proxy without much effort.

Maybe you could use something like the script I posted the other day to
remove duplicates from your collection and bring your spam:ham ratio into
something closer to 1:1.

Skip