[Spambayes] Is Equal Ham & Spam really the best?

Shawn K. Hall shawn at 12pointdesign.com
Sun Jul 29 21:25:23 CEST 2007


> Even training only on mistakes and unsures, I have had a steadily
> increasing ratio for months.  I almost never see a misclassified
> ham and only very rarely a ham about which the system is unsure.
> It's unsure about spam every day.

I run several servers, and hundreds of domains, so I get quite a bit of
email - even if much of it is just for archival purposes (logs mostly).
I keep MOST of my ham. I no longer keep spam beyond about 2 months or
so, but in that period I can easily collect some 100,000 spam that would
otherwise totally dwarf the amount of ham I receive. I get at least 2k
messages per day, sometimes as much as 5k.

I never exactly plan to rebuild the database, but always do when I make
a big mistake. While I never really had a problem with the effectiveness
of SpamBayes before, a couple times I've clicked the wrong button in the
'unsure' folder when I had fifteen+ spam or ham selected, which can
quite effectively destroy the database. So I purge it and retrain on my
current archive of spam and a couple known good folders under the inbox
that I have stored a few thousand messages. Having that archive of known
good messages makes all the difference in the world. I now have a
database of about 90k/80k (the db is about 330mb) and only receive about
25 unclassified messages per day on average, which consists of about 20%
either gobbledygook or legit messages with no content except for their
attachments or a blank subject - the other 80% are 'trainable' spam. I
train on all ham and only those spam messages that look like they'll
make a difference to the validity of future checks. If it's a
gobbledygook spam message, I usually just delete it directly from
unsure.

I still use 75%/15% as the spam cutoffs. While I could probably avoid
looking at subject lines for approximately 50-60% of the spam that goes
to unsure by lowering the spam cutoff to 60%, it takes only a few extra
seconds to look through those other subjects or senders once per day to
correct their status. I'd rather not risk losing an important message
from a client that is forwarding a spam message they received directly
to the spam folder. Once it's in there I don't even bother looking at it
but once per month when I use the library of spam I've collected to
fine-tune my server-side filters. Legitimate forum and group messages
can often be flagged higher than 10%, so I don't want to lower my ham
threshold. If anything, it could use to go up to 20% or so. The
no-subject or attachment-only ham are almost always high teens or low
twenty scores, but if I adjust the ham setting I'll get a bit more of
the gobbledygook to my inbox, too.

Anyway... 
Just thought a bit more anecdotal evidence might be interesting to some.
;)

Regards,

Shawn K. Hall
http://12PointDesign.com/

'// ========================================================
    "You have to change the map, not the world."
      -- Marcus Kaarto




More information about the SpamBayes mailing list