[spambayes-dev] I took a big step Tuesday...

Thu Jul 24 11:37:36 EDT 2003

After having used Spambayes since last September and scanning all messages
marked as spam during that time, I made the decision a couple days ago to
simply dump spam which scores 1.00 (or 1,00 if you've been following the
recent locale saga).  I mention it here to suggest that maybe it's
worthwhile to consider creating finer-grained "spam" categories.

Here's some data.  I have been logging the scores of mail which arrives
since early June, about 50 days at this point.  Each message which arrives
and is scored gets an entry in a log file like so:

    2003-06-04:15:14 spam; 1.00 <200306040951.CAA02043 at z.ew01.com>

This allows me to go back and look at the distribution of incoming
messages.  In that time I've received nearly 59000 messages.  The
distribution of scores near the extremes looks like this:

    0.00 32854
    0.01 1432
    0.02 453
    0.03 179
    0.04 95
    0.05 73
    0.06 40
    0.07 48
    0.08 102
    0.09 33
    0.10 32
    ...
    0.90 64
    0.91 86
    0.92 106
    0.93 103
    0.94 123
    0.95 183
    0.96 244
    0.97 381
    0.98 638
    0.99 1406
    1.00 19083

I currently have my ham and spam cutoffs set at 0.15 and 0.80, respectively.
As I mentioned in a recent message, I consider 0.80 to 0.90 to be "low spam"
and 0.91 to 1.00 to be "high spam".  The step I took Tuesday was to simply
dump mail which scores 1.00.  That eliminates roughly 85% of the spam from
consideration, I guess around 200 messages per day (not the 380+ messages
per day the numbers above suggest, as my procmailrc file already has a
couple ruless to filter out a lot of spam duplicates).

I don't recall the last time I saw a false positive, and the place where
mistakes are most likely to be made are in the lower scoring spams.  I
figure that with the size of my training set (21000+ messages) and the lack
of false positives, the risk of deleting a valid message is low enough.

Relating that to spambayes-dev subject matter, perhaps a "super-spam" cutoff
could be created which would automatically delete messages which score at or
above that value if the user's training set was "large enough".  Thus, if
they started training from scratch it would have no effect.  By default, it
would be set to something > 1.0 to prevent it from coming into play
unexpectedly.  I don't know what "large enough" is though.

Skip