[Spambayes] Is Equal Ham & Spam really the best?

David Abrahams dave at boost-consulting.com
Sun Jul 29 20:32:09 CEST 2007


on Fri Jul 27 2007, "Mark Hammond" <mhammond-AT-skippinet.com.au> wrote:

>> That is high relative to the conventional wisdom, but I'm questioning
>> the correctness of that wisdom.
>
> Check out this thread, which should give you a reasonable idea:
>
> http://mail.python.org/pipermail/spambayes-dev/2003-November/001578.html
>
>> Perhaps its time to re-evaluate that statement?
>
> Google also shows anecdotal reports of poor results after an imbalance as
> low as 2:1, so I don't think it would be responsible to re-evaluate that
> statement until clear evidence was presented to the contrary.

Because those tests don't have all the same real-world constraints as
I do, I'm still trying to figure out whether they answer my question:

   Is it better to withold data (some previously-misclassified spams)
   from the system when training in order to keep ham and spam
   balanced, or will I get better results if I let it see all the
   previously-misclassified spam despite the imbalance?

In my admittedly not-rigorously-tested experience, it's generally
better to let the system see more data (at least with
train-to-exhaustion).

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com



More information about the SpamBayes mailing list