[Spambayes] full o' spaces
Skip Montanaro
skip at pobox.com
Sun Mar 9 08:38:23 EST 2003
Tim> Ok, I train on virtually every piece of mail that comes into my
Tim> notes inbox. the ratio is about 10:1 spam:ham. I currently have
Tim> about 600 spam trained into the database. I still get maybe
Tim> 10%-15% unsure, invariably on spam. I virtually never have a FP.
Tim> Maybe I just need to adjust the spam cutoff... Mainly thinking out
Tim> loud, and bemoaning the fact that I've annoyed my namesake.
Tim,
I know your Notes environment may not allow this, but I do a couple things
to minimize the number of duplicate postings that ever get considered. At
the very start of my .procmailrc file I remove messages with a message-id
I've seen recently:
# make sure we don't get two copies of the same message
:0 Wh: msgid.lock
| $FORMAIL -D 16384 $HOME/tmp/msgid.cache
Later, after a message has been determined to be spam, I run my loose
checksum script and dump the message if it looks the same as a previous
spam:
:0
* ^X-Spambayes-Classification: spam
{
### this recipe gobbles items with matching body checksums (taken
### loosely to try and avoid obvious tricks)
:0 W: cksum.lock
| $PYCKSUM -v $HOME/tmp/cksum.cache
:0:
$SPAM
}
If I didn't take these steps I'm sure I'd get more spam (and probably see
more mistakes). Since building my initial large training set, I have
generally only trained on mistakes and unsures. Accordingly, I have about
12,000 saved hams and 7,000 saved spams. If the code changes I retrain
completely, but generally only retrain on new messages.
I think either of these techniques (message-id caching and loose checksums)
could be incorporated into pop3proxy without much effort.
Maybe you could use something like the script I posted the other day to
remove duplicates from your collection and bring your spam:ham ratio into
something closer to 1:1.
Skip
More information about the Spambayes
mailing list