[Spambayes] (non-)spam count would go negative!
Harald Hanche-Olsen
hanche at math.ntnu.no
Thu May 8 00:59:14 EDT 2003
I am just getting started playing around with spambayes. I have a
folder with about 16000 spam, plus various folders of non-spam (er,
ham you call it, about 20000 in all). So I ran mboxtrain.py over the
whole collection. Then I checked the result using hammie.py -u. Of
course, in my spam folder I found a small handful of nonspam messages,
and in my nonspam folders I found a somewhat larger handful of spams.
So I moved these message to their rightful folders, deleted the
database, and retrained. But this time, mboxtrain.py dies with a
message
(non-)spam count would go negative!
when it gets across one of the reclassified messages.
After a look in the source code and a visit to the mailing list
archives, I guess I understand how this happened. So now I am
steering away from mboxtrain and using hammie.py for the training
instead. Is the notion of marking messages used for training with a
header line really well thought out? From my limited experience with
this "feature", I would suggest not.
Otherwise I like what I see so far: I see 0.01% of hams incorrectly
labeled as spam, 0.3% of spams labeled ham, and somewhat less than 1%
of either category marked as unsure. Not bad at all, I think.
- Harald
More information about the Spambayes
mailing list