[spambayes-dev] Another piece of anecdotal evidence

Wed Jan 14 14:36:04 EST 2004

[Skip]
>>> How do you plan to find those mistrained messages?

[Alex]
>> As part of my nightly retrain, I'm going to make it score each
>> message (with the fully trained DB) and sort them into directories
>> for each month:
>>     {ham,spam}{positive,unsure,negative}
>> Flipping through the hampositive directory for each month should
>> make it fairly easy to spot the problems...

[Skip]
> I'm still confused.  You've got a spam mistrained as ham.  Are you
> suggesting that you expect that scoring that message against your
> training database (which includes features gleaned from that message)
> will reveal that it is something other than ham?  I have a very small
> training database (microscopic compared to yours) and I generally
> find it easier to just start from scratch when I reach the conclusion
> that I have some errors in my database.

I'd suggest running a cross-validation test, with any n >= 2, and setting
the testing options to show FP and FN.  This is extremely effective (IME) at
finding misclassified messages, particulary since a CV run never tests a
message against a classifier that's been trained on that msg (unless you've
got duplicates of a message, yadda yadda).