[Spambayes] Is Equal Ham & Spam really the best?
David Abrahams
dave at boost-consulting.com
Sun Jul 29 20:32:09 CEST 2007
on Fri Jul 27 2007, "Mark Hammond" <mhammond-AT-skippinet.com.au> wrote:
>> That is high relative to the conventional wisdom, but I'm questioning
>> the correctness of that wisdom.
>
> Check out this thread, which should give you a reasonable idea:
>
> http://mail.python.org/pipermail/spambayes-dev/2003-November/001578.html
>
>> Perhaps its time to re-evaluate that statement?
>
> Google also shows anecdotal reports of poor results after an imbalance as
> low as 2:1, so I don't think it would be responsible to re-evaluate that
> statement until clear evidence was presented to the contrary.
Because those tests don't have all the same real-world constraints as
I do, I'm still trying to figure out whether they answer my question:
Is it better to withold data (some previously-misclassified spams)
from the system when training in order to keep ham and spam
balanced, or will I get better results if I let it see all the
previously-misclassified spam despite the imbalance?
In my admittedly not-rigorously-tested experience, it's generally
better to let the system see more data (at least with
train-to-exhaustion).
--
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com
More information about the SpamBayes
mailing list