[spambayes-dev] Evaluating a training corpus

Meyer, Tony T.A.Meyer at massey.ac.nz
Mon Jun 9 10:26:40 EDT 2003


> mboxtest.py is probably the easiest to get going.  I think 
> timcv.py gives better results but it's a little more trouble 
> to setup your test data.  See README.txt for a short 
> explanation of the tools.  If you want to use timcv.py, you 
> can use splitndirs.py to create the test data.

Which is preferred, timtest or timcv?  The readme has:
    [timcv] is the preferred way to test when possible:  it
    makes best use of limited data, and interpreting results is
    straightforward.
But also:
    [timtest] is a much harder test than timcv, because it trains on N-1
times
    less data, and makes each classifier predict against N-1 times
    more data than it's been taught about.
And I would have thought that a harder test was a better test.  (I
presume that if I understood more statistics I could answer this
myself...).

=Tony Meyer



More information about the spambayes-dev mailing list