[Spambayes] (non-)spam count would go negative!

Harald Hanche-Olsen hanche at math.ntnu.no
Thu May 8 00:59:14 EDT 2003


I am just getting started playing around with spambayes.  I have a
folder with about 16000 spam, plus various folders of non-spam (er,
ham you call it, about 20000 in all).  So I ran mboxtrain.py over the
whole collection.  Then I checked the result using hammie.py -u.  Of
course, in my spam folder I found a small handful of nonspam messages,
and in my nonspam folders I found a somewhat larger handful of spams.
So I moved these message to their rightful folders, deleted the
database, and retrained.  But this time, mboxtrain.py dies with a
message

  (non-)spam count would go negative!

when it gets across one of the reclassified messages.

After a look in the source code and a visit to the mailing list
archives, I guess I understand how this happened.  So now I am
steering away from mboxtrain and using hammie.py for the training
instead.  Is the notion of marking messages used for training with a
header line really well thought out?  From my limited experience with
this "feature", I would suggest not.

Otherwise I like what I see so far: I see 0.01% of hams incorrectly
labeled as spam, 0.3% of spams labeled ham, and somewhat less than 1%
of either category marked as unsure.  Not bad at all, I think.

- Harald



More information about the Spambayes mailing list