[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Tue, 24 Sep 2002 23:54:50 -0400


[Tim report his usual pathetic results]
> total unique fp went from 1 to 1 tied
> mean fp % went from 0.05 to 0.05 tied
>
> total unique fn went from 1 to 0 won   -100.00%
> mean fn % went from 0.05 to 0.0 won   -100.00%

[Neil Schemenauer]
> You're a victim of your own success. :-)  Could you try sabotaging the
> classifier to make it work harder and produce higher error rates?  For
> example, don't let it look at any header information.

I'm not sure it would help enough (meaning I'm not sure it would hurt enough
<wink>).  Recall that I started by ignoring the headers completely, and
adding the few in that I did made only fraction-of-percent differences to
me.

I've said before that I'm suspicious the *tokenizer* is overtuned to my test
data:  I pretty much tokenized all and only the things that made a
significant difference for it (without picking up trivially strong clues
about newsgroup postings vs bruceg), and even tokenized different parts in
different ways likewise.  As TESTING.txt has prophetically said all along,

+ Any sufficiently general scheme with enough free parameters can
  eventually be trained to recognize any specific dataset exactly.
  It's wonderful if other people test your changes against other
  datasets too.  That's hard to arrange, so at least change your own
  data periodically.  I'm suspicious that some of the weirder "proven
  winner" changes I've made are really specific to statistical
  anomalies in my test data; and as the error rates get closer to 0%,
  the chance that a winning change helped only a few specific msgs
  zooms (of course sometimes that's intentional!  I haven't been shy
  about adding changes specifically geared toward squahsing very
  narrow classes of false positives).

What would really be interesting, then, is if Skip gave me his corpus --
he's reported by far the worst error rates of anyone, and I have an amply
demonstrated ability to change 1000 things at once and end up fitting the
data exactly <0.8 wink>.

I definitely need *some* new corpus to strain against.  GregW has made great
progress on collecting a new python.org corpus, with oodles of Asian spam,
and ham msgs other than just c.l.py traffic.  I look forward to watching my
error rates go down the toilet again.

But, for purposes of the test at hand, it's actually helpful:  the question
is which scheme does better, Guido's rates are much worse than mine, and
it's important to know whether one scheme dominates consistently, or whether
it's a mixed bag.