[Spambayes] More ratio experiments

T. Alexander Popiel popiel@wolfskeep.com
Wed, 09 Oct 2002 08:37:43 -0700


In message:  <LNBBLJKPBEHFEDALKOLCEEGOBIAB.tim_one@email.msn.com>
             "Tim Peters" <tim_one@email.msn.com> writes:
>
>Alex left some of the test driver output intact

All of the test driver output is available at

  http://www.wolfskeep.com/~popiel/spambayes/ratio2

just in case someone wants to look at it.  Histograms, more
verbose indications of the training and testing cycles, false
positive excerpts, and everything.


After sleeping on the data (yes, my bedroom is over the computer
rooms ;-) ), some more things are niggling at me... like the
error rates (specifically fn) going _UP_ as more training data
is added for the very low ham:spam ratios.  I'm guessing that
that's due to the classifier seeming to discover that yes, there
_is_ ham in the universe, and maybe more stuff should be classified
as ham.

I'm also wondering if there's a point at which where dropping the
ham:spam ratio starts increasing the fn rate, holding the training
set size constant (this I can test), and if there's an amount of
training data above which low ham:spam is nolonger good, or even bad
(this I don't have enough data to test).

Lastly, I'm wondering if I should even bother with the non-central-limit
stuff anymore, since the central-limit stuff seems from other reports
to be more interesting.  (I really ought to do comparisons among the
7 extant classifiers (default, clt[123] x {cl,rms}pik) on my data...
heck, it might even be getting close to shootout time again...

- Alex