[Spambayes] how spambayes handles image-only spams

Thu Sep 11 13:41:45 EDT 2003

[Bill Yerazunis]
>> What's your test protocol?  I did "shuffle messages randomly,
>> but preserve knowledge of which class they were in, then
>> train with the first 90% and then test with the last 10%".
>> Repeat as needed...

[Tony Meyer]
> I did "rebal.py -n5", which IIRC is roughly equivalent to "shuffle
> messages randomly, but preserve knowledge of which class they were
> in". I then did "timtest.py -n5".
>
> I'm happy to admit I understand little of what the testing code does,
> just how to interpret (most of) the results that it gives me.  This is
> one of the strengths of the spambayes testing suite, IMO (not that I
> have tried any other testing suites).
>
> The readme says that it does this:
> """
> Runs an NxN test grid, skipping the diagonal:
>     N classifiers are built.
>     N-1 runs are done with each classifier.
>     Each classifier is trained on 1 set, and predicts against each of
>         the N-1 remaining sets (those not used to train the
> classifier). """
>
> So in my case, I think this means that I train with the first 20%,
> then test with each of the remaining 20%s (and repeat).  I may be
> wrong <wink>.

That's a good description of what it does.  It's not the preferred way to
test, because it's hard to interpret the results, is slow (N**2-N test runs
are made), and it's brutal (in your case using -n5, each classifier built is
tested against 4x as many messages as it was trained on).

timtest.py is a more traditional cross-validation test driver, probably much
closer to what Bill is doing.  It's easier to interpret the results, runs
faster, and will almost always deliver "better-looking results" than
timtest.py delivers, because the cross-validation driver trains on many more
messages than it tries to classify (the opposite is true of timtest.py).