spam,Re: [Spambayes] tons of false positives after upgrading

Tue Jan 11 06:44:02 CET 2005

OK, I went to re-train from scratch. I removed hammie.db,
message_info_database.db, and statistics_database.db from Documents and
Settings/Owner/Application Data/SpamBayes/Proxy.  I was going along fine all
day, training new messages, then I went to review some additional messages
(by right clicking on the tray icon). It pulled up a ton more messages than
I was expecting, so I discarded all except 8 of them.  Then I went to the
home page and it says:
Database only has 7 good and 1 spam - you should consider performing
additional training.

Apparently the reason it pulled up a ton of messages, was because all of a
sudden it decided that it hadn't trained on them already, even though it
had.  So the question is, what did I do before the error occurred, that
might have caused spambayes to suddenly not remember any previous training.
The answer -- the only thing i did was to modify the configuration, so it
would put the string "spam," in the "To:" and "Subject:" headers.

So is modifying the configuration supposed to undo all the prior training?
If not, any guesses on why this happened?

Thanks for your assistance.

----- Original Message ----- 
From: "Tim Peters" <tim.peters at gmail.com>
To: <spam>; "Nate Tanner" <n.tanner at lunchclub.net>
Cc: <spambayes at python.org>
Sent: Sunday, January 09, 2005 10:35 PM
Subject: spam,Re: [Spambayes] tons of false positives after upgrading

[Nate Tanner]
> i had been using version 0.3 of spambayes for a long time (XP/outlook
> express) and it was working fairly well.  i recently upgraded to 1.0.1,
and
> now i get a ton of false positives (including the confirmation and welcome
> messages from this mailing list !!)  probably close to 20% of my valid
> emails are being marked as spam.
>
> does anyone have any ideas about how to fix this problem?  it's worse now
> than if i had no filter, because i have to comb through every spam looking
> for non-spams!  please help!

As Tony suggested, retrain from scratch.  Some of the stuff in your
data really doesn't make sense.  For example,

> ...
> Total emails trained: Spam: 1299 Ham: 3644
...
> header:Subject:1  0.673037 1135 833
> header:From:1     0.675923 1139 847
> header:To:1         0.67644   1139 849
> header:Date:1      0.676889 1138 850

That says, for example, that 3644-1135=2509 of the ham messages you
trained on didn't have a Subject line.  That's unbelievable -- or you
have very weird ham <wink>.  Similarly, about 2,500 of your ham
messages didn't have a To line, From line, or Date line in the
headers.  Those are equally incredible.  These kinds of header lines
should appear in virtually all email, whether ham or spam, and then
they're judged as neutral.  Instead the presence of a Subject line
"looks spammy" to your database, and that's nuts.

This is also incredible:

> sender:no real name:2**0 0.004644 48 0

That says you've trained on no spam at all where the From line didn't
contain a real name -- yet that's very common in spam, and moderately
unusual in ham.  You even have ubiquitous words like "the" and "and"
scoring as spammy!  Something is seriously messed up with the training
here -- start over.