[Spambayes] There Can Be Only One

Fri, 27 Sep 2002 15:52:49 -0400

The result:  Gary's f(w) scheme wins.

After looking at all the reports once, I was *going* to say that Graham's
scheme had only one clear advantage:  a single spam_cutoff value that works
well for everyone.

But that wasn't so:  although it was outside the test protocol (which is why
I ignored it the first time around), on Anthony's full and very difficult
11,000 spam and 19,000 ham corpus, Graham got 135 fp to his tuned-Robinson
10 fp.

The value I initially suggested for spam_cutoff was generally too small, and
the value for robinson_probability_a too large.  The people who worked
hardest at helping to tune these (Neil and Anthony) converged on an "a"
close to 0.2, which also worked well for me.  The other parameters (i.e.,
apart from "a" and spam_cutoff) were close to optimal from the start, and
didn't show signs of corpus dependence.  Everyone who worked at tuning got
results at least as good as with Graham, several slightly better than with
Graham, and some much better.

Apart from all that, as the guy who has written all of the core scoring code
so far, I'm *much* more comfortable with Gary's formulation:  it has fewer
tuning parameters than Graham's, and they *make sense*, both from
theoretical and practical POVs.  I said in the first msg of this thread that
Gary's scheme would win for that reason alone if it merely did no worse than
Graham's.  That it's often a small win and sometimes a large win nails it.

spam_cutoff remains "a problem" or "an opportunity", depending on how you
look at it.  I think it's some of each:

+ Against it, the best value does vary by corpus, there's no
  way evident to guess the best value a priori, and it's not
  trivial to figure out the best value a posteriori either.

+ For it, there's a real and valuable "middle ground" here:  moving
  spam_cutoff a few points allows people to favor f-n over f-p (or
  vice versa) easily, effectively, and gradually.

The "cosmetics" are better under f(w) too.  By that I mean things that
aren't really technical advantages either way, but that people *like*
better.  For example, I expect everyone hated seeing f-p under Graham's
scheme that came out with a score of 1.0 ("that's certainly spam!"), and
seeing f-n come out with a score of 1e-29 ("that's certainly ham!").  f(w)
appears never to do that:  the bulk of f-n and f-p scores are a short
distance from the corpus's best spam_cutoff, and even in Skip's untuned f(w)
run on his very hard corpus, no f-p scored above 0.60, and no f-n below
0.20.

Thanks for playing!  Unless someone has a killer objection, I'm going to
purge the stuff unique to the Graham scheme from the codebase, and switch
the defaults to Gary's f(w) scheme.

If you're dying to test something else, here are two open questions of real
practical importance:

1. How does the ham/spam training ratio versus real-life ham/spam
   ratio affect error rates?

2. How does the absolute ham+spam training size affect error rates?
   (i.e., how many messages do we need to train on to get a desired
   accuracy level?)

A third:

3. Is it possible to "seed" a database with somebody else's data
   and get decent results out of the box?

If anyone would care to champion one of these, grab it, dream up a test
strategy, announce it here and solicit results.  I'm least interested in #3
right now because I've got the slimmest of evidence that a key part of the
central-limit approaches may indeed be sharable for both ham and spam, and
it's premature to test anything with those approaches yet (we don't yet have
a handle on how to produce a reasonable score from these -- the scores they
produce right now are absurd).