[Spambayes] How low can you go?

Wed Dec 10 15:56:27 EST 2003

[Skip Montanaro]
> No starting corpus other than mail as it arrived and the two initial pump
> primers.  They were recently received messages as well though.  I just
> wanted something to keep the initial scores from all being 0.50.

This is great.  I think it's the ultimate case of incremental training.
Don't you think the result would be nearly the same if you used your
incoming mail stream or a saved corpus.  The only random part is the one
message of each type that you pick first.  That causes a particular sort
order for everything else, and that guides you as to what to train on next.
Depending on which two messages you started with, you might wind up with a
different training set, though not necessarily better or worse.  But doing
it one at a time tends to give you the least duplication in messages that
you select for training.  I think an interesting variation on this would be
to start with one message of each type, score a small corpus of equal
numbers of ham and spam (say 100-150 of each), and always add one spam plus
one ham to the training set each time.  That way the sets will stay
balanced.  As you suggest, hams are probably easier to classify, so without
this, you would tend to have fewer trained messages, but more imbalance.  I
suppose that's OK to a point, wherever that is.

It should be possible to automate the training on the single worst ham + the
single worst spam on each pass, guaranteeing a balanced training set and the
least duplication.  However, with a small training set, each message that
you add could skew things quite a bit as each new message can change the
estimated classifier a lot.

I'm still amazed that it can classify at all based on only 34 messages.  I
wonder if it would do better on small training sets if it was allowed to use
more than 150 tokens when scoring?  I'm assuming that every message token is
put into the database when a message is trained.  If that's so, there's more
information that we're not using when the token counts haven't yet settled
down closer to their expected values.

That's an interesting question:  if the token count mean-squared error is
large, does including more tokens reduce the variance of the message score?
If the token count error is zero mean (a big if) with a reasonable
distribution (another big if), I'd have to guess that it would.  Otherwise,
it wouldn't help and could even get worse, but I doubt it's that badly
behaved.  I could see why it wouldn't make much difference with a large
training set, since the errors on the individual token counts are smaller
and we're combining (though not linearly) 150 of them, so the total
estimation error is down in the noise.  But with only a few trained
messages, the token counts are all small enough that the errors are
necessarily larger, if for no other reason due to quantization.  For
example, for a given number of messages, let's say the expected value of a
particular token count is 1.33.  If we're lucky, we probably have a count of
1 or 2 (best case), which would be a 33% error.  That's not so good for the
best case.  With ten times that number of messages that have that token, the
error from quantization goes down quickly.  At some point, other error
sources will dominate, but for very small training sets, this one might be
important.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above