[Spambayes] cmp.py with mean and dev comparison

Tim Peters tim.one@comcast.net
Sun, 22 Sep 2002 15:19:45 -0400


[Brad Clements]
> But then I'd have to change timcv.py

Why?  The means and sdevs are computed and displayed by the Hist(ogram)
class in TestDriver.py.  timcv.py is just a tiny wrapper that adapts
TestDriver to "the standard" Ham/Spam directory structure, and drives
TestDriver in a cross-validation way.  timcv.py doesn't compute or print
anything directly.

> and re-run my tests.. :-(
>
> PIII-933 isn't very fast..

When you run 50x as many tests on a slower machine, I'll have some sympathy
based on comparable experience <wink>.

> But .. since I'll be spelunking the timcv.py code, I think I'll
> also work on a "end-user emulation" module. Where I can simulate
> receiving messages and training as I go.

Such stuff probably belongs in new classes.

> I want to determine the daily error rate as I train the system. I
> suppose I'll have to be able to specify a ratio of ham/spam (< 1 in my
> case) and a "daily message count".  What's typical numbers to use?

For whom?  I get about 600 emails per day.  Is that typical?  It certainly
is for me <wink>.  Some people get very little spam, John Draper reported
something like 70% spam.

> ...
> I suspect most users will be diligent about feeding spam into the
> trainer, but will be lazy about feeding it ham.

Me too.  Gary fiddled the formulas in his approach to try to be robust in
the face of this.  No experiments have been run to test it, though, neither
under Gary's nor Paul's schemes.